Sharable analysis artifact repositories #6

hafen · 2015-02-10T22:54:51Z

I can see the scope of this blowing up pretty quickly, but here's one idea I've been thinking about recently that seems applicable to rOpenSci that might be worth discussing at the unconf.

When analyzing data we do a lot of data munging and make a lot of plots. It's an iterative process, and much of what we do is not worth keeping track of formally, but when we find useful visualizations or derive data sets that will become the basis for further analysis, we want to be able to keep track of these things in an organized way. Further, we often want or need to share these easily with others who are also involved in the analysis. This is particularly a problem when dealing with large data sets or very long or complicated steps in the analysis process, or when the types of artifacts are so varied that we can't organize things in dirctories or in our R workspace.

I have heard these intermediate visualizations and data referred to as analysis "artifacts", which I like because of its definition:

artifact: any object made by human beings, especially with a view to subsequent use.

We talk a lot about reproducible research, reproducible software environments, open data, sharable code through packages, but I think analysis artifacts should be first-class citizens in this discussion as well.

To keep it simple, I'm thinking it would be worth exploring the idea of building a lightweight R-based extensible artifact "registry" with an API to plug in different data types (text files, rda, database, etc.) and their associated back ends or locations (s3, server, etc.) or visualizations / applications (shiny, ggplot, ggvis, htmlwidgets) or even documents (rmarkdown, etc.). The registry would basically provide an organization of the different artifacts attributes / meta data (description, tags, version, date, access, etc.), and ideally a simple visual interface to see them all.

An artifact repository plugin would have to have methods for at least the following:

add the artifact to the registry
access the artifact from the registry
embed or render the artifact in an rmarkdown document
etc.

The other question is how to share a registry. Github for code is great, but github for sharing artifacts is terrible (versioning binary objects, etc.). But if an artifact registry is simply a set of properties that describe where and what things are, Github could be a good idea. Also sharing and serving a visual registry explorer with shiny server / shinyapps is an interesting idea.

There are a lot of related efforts and it could get complicated quickly, but it would be cool to discuss these ideas if people are interested.

cboettig · 2015-02-10T23:59:28Z

Great suggestion! Would you consider the related issue of reproducing these artifacts as in scope here as well? (e.g. running the R code that generates them vs simply sharing the output). I think it would be useful to motivate this issue with a focal example of a particular set of artifacts we would like to share, and the challenges that arise.

In particular, the use of Docker Hub for sharing / discovering / distributing binaries, as a complement to Github, might be of interest here; work with @eddelbuettel & myself on rocker might provide a starting point for this.

You may also be interested in this special issue of Operating Systems Review on Repeatability and Sharing of Experimental Artifacts (Disclaimer that I contributed an article to this issue)

gmbecker · 2015-02-13T15:00:08Z

@hafen, @cboettig I think this is a great idea. Michael Lawrence and I have been thinking about this recently, though from a slightly different angle. Our department includes a lot analysts (~40 maybe?) each of which generate a lot of artifacts on varied but potentially overlapping analyses. As such, we've been viewing a system like @hafen describes as a tool for discoverability; both for a single analyst within his/her body of work ("I KNOW I have that plot somewhere...") and between analysts ( "has anyone already look at those genes in these samples? What kinds of plots did they make?").

One fun aspect of this is trying to automagically generate the metadata for artifacts from the code which generated them and/or the artifacts themselves. I'm actually approved for an intern this summer to look into that, but I doubt the problem will be fully solved by him or her in a summer.

We'd be very interested in a long-term collaboration on something like this. For the record, we generally release the software (R packages) we work on (Michael Lawrence and myself, at least) under OSS licenses I think are acceptable to ROpenSci policies (Artistic-2.0 usually), so AFAIK a collaboration with ROpenSci or members thereof wouldn't be a problem from our side.

hafen · 2015-03-20T22:08:08Z

@cboettig, @gmbecker, sorry I'm so slow to respond on this.

@cboettig, thanks for the link to the special issue. I'll take a look. I'd like to discuss the use of Docker with you to get a better understanding of how it might be used to register / store / share binary objects - I've only used it for DevOps type purposes (maybe I'll find the answers when I read your paper). It sounds like this might not be a mainstream topic at the unconf but I'd like to find some time to discuss.

@gmbecker, I'd love to discuss your thoughts on what you guys are looking for. I have a project where having something like this would be nice so I've been doing some brainstorming. Since something like this is at high risk for feature creep, it'd be good to stick to the idea of doing a few things well, without imposing much on the user in terms of having to install weird things or do anything too far off from what they are used to doing. Automatically generating metadata is a cool idea.

richfitz mentioned this issue Feb 12, 2015

Packages as research repositories/compendia #11

Open

cboettig mentioned this issue Mar 19, 2015

The R package as the unit of reproducible research #31

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sharable analysis artifact repositories #6

Sharable analysis artifact repositories #6

hafen commented Feb 10, 2015

cboettig commented Feb 10, 2015

gmbecker commented Feb 13, 2015

hafen commented Mar 20, 2015

Sharable analysis artifact repositories #6

Sharable analysis artifact repositories #6

Comments

hafen commented Feb 10, 2015

cboettig commented Feb 10, 2015

gmbecker commented Feb 13, 2015

hafen commented Mar 20, 2015