Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Choose a reproducibility platform/workflow #188

Closed
brainwane opened this issue Aug 4, 2018 · 5 comments
Closed

Choose a reproducibility platform/workflow #188

brainwane opened this issue Aug 4, 2018 · 5 comments

Comments

@brainwane
Copy link
Contributor

Right now, to store and share replications and reproductions, we use a special-purpose repository called "REMARK: Replications and Explorations Made using the ARK". We're investigating whether we should stay with that or choose a different approach.

@llorracc has asked our NumFOCUS peers what they're using and recommending, and I've asked peers within The Carpentries the same question.

Let's share notes here in this GitHub thread and make the decision here.

@llorracc
Copy link
Collaborator

llorracc commented Aug 4, 2018

Good idea to use a GitHub thread for this -- since GitHub itself is likely to be an integral part of whatever the answer will prove to be.

One other point is that we will need not only to choose a technology, but also:

  1. Create some working examples of the use of that technology
  2. Create materials for teaching students to use the technology
    • This may mostly involve a curated set of links to existing materials

@llorracc
Copy link
Collaborator

llorracc commented Aug 4, 2018

So far (2018-08-04) the numfocus discussion has surfaced two candidates that we had not found before:

Sumatra

The mission statement is spot-on for what we are trying to achieve. But we are not sure how active the project is, or whether it aims to expand to fields other than its home in neuroscience.

Datalad

Datalad’s aim is to record virtually every step in a whole neuroscience research project (code, data, writing, etc). For a team of people with a complete grasp of Datalad and a lot of experience with it, this sounds like it might be a great tool. It could be adapted to our narrower purposes, but how easily is unclear.

We've also looked at less specialized tools including:

  1. RunMyCode.org
  2. Dataverse.org
  3. VisTrails
  4. Stencila
  5. ReproZip

but in our quick assessment none of them looked to have the combination we are seeking of being:

  1. Lightweight (don't ask too much of the authors)
  2. Future-proof (if the project dies, is the archive still usable)
  3. Focused on our goal
    • Not just archiving code, but ensuring that it runs and will continue to be runnable indefinitely

@VickyRampin
Copy link

VickyRampin commented Aug 5, 2018

I believe ReproZip is on-mission for you! It takes 2 steps for folks to bundle their work with ReproZip, and ReproZip not only captures all data and code, but the environment in which everything runs is captured. ReproZip gets everything needed to rerun someone's work and creates a single distributable bundle of it. You can see examples at: https://examples.reprozip.org

I actually wrote a paper about how since the ReproZip format is generalizable, it's great for archival purposes (I'm a librarian trained in digital preservation): https://iassistquarterly.com/index.php/iassist/article/view/18

ReproZip has unpackers (docker, vagrant), but doesn't rely on them to rerun the bundle. We are actually adding singularity as an available unpacker now!

Anyway, sorry to butt into your issue/convo, just wanted to add some additional info!

Edited to add: we also have a way to unpack & interact with ReproZip bundles in the cloud, a tool called ReproServer. In an author-reviewer scenario, the author uses ReproZip to make a bundle of all the dependencies and workflow to rerun their work correctly in the original environment in which they work. Then, the author can either put their RPZ file in a repository (OSF, figshare, etc) or send in the RPZ file with their paper. If the author sends the RPZ file, the reviewer can upload the RPZ into ReproServer and rerun it. If the author uploads the RPZ into a repository, they just need to give the link and the reviewer can pass the link to ReproServer and it'll grab it to reexecute the work within.

@mnwhite
Copy link
Contributor

mnwhite commented Sep 4, 2018

That sounds promising.

@shaunagm
Copy link
Contributor

Closing this, it's closely related to Overark's issue #5 so we can discuss it there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants