-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Thoughts and questions on a first thorough review #18
Comments
On Mon, Feb 22, 2016 at 9:28 PM C. Titus Brown notifications@github.com
(There is also some work by guys from IBM to import notebooks into
I see two things to focus on: educational material (how to make these |
On last point - how about connecting with gigascience, or gitxiv; and we can write in connections to software and data carpentry although serious lesson dev would be out of scope and budget for first round, I think. |
I have Gigascience or PeerJ (or PLOS??) on my list of publishers that should be up for this but I unfortunately have no contacts there what so ever. Do you know someone who could make an introduction? I could potential be introduced to someone from the arxiv but they seem extremely busy and hence quite conservative towards new-new ideas. Conclusion: focus on the web app/tools part. Lessons and education as nice to haves. (please disagree if you do) |
I'm on Ed board at GigaScience and PeerJ CS. |
I know somebody from Springer nature - I think they own Gigascience - at least last time I checked. can try to contact them as soon there is a first distributable version |
O'Reilly also has some connections to PeerJ -- who would you be trying to reach there? |
Another idea we are working on - a science hackathon: I would love to facilitate that papers written there could try the "dynamic & interactive" way - very early proposal: https://docs.google.com/document/d/1HwiQxyVG1CnW6AUbFQ-0yMT-BYHi7MnVzXCKSId-xXg |
@odewahn not sure who you'd want to contact. Educated guess: editors. They are scientists themselves so could be excited by the prospect, if they like the idea they can champion it within the publisher. |
Oh, and we could talk to biorxiv, too. I don't think there's a shortage of publishers that would be interested. But, this leads in another interesting direction - one of the big concerns I see from the perspective of publishers and librarians is that the technology and formats are changing very fast, so it's not at all clear that in (eg) 5 years we will be able to run today's Jupyter Notebooks inside of Docker. Perhaps part of our proposal could focus on doing something about that in the next year - it's probably too early to build standards, but defining the minimal ingredients could be useful at this point. While I'm randomly brainstorming, any thoughts on bringing the R community (which is QUITE large in bio and biostats) technology into the fold here? I have some experience with RStudio and RMarkdown, less with Shiny. |
p.s. I can broker introductions with many journal editors. We should figure out what we want to say rather than worrying too much about who to say it to :) |
(I'll summarize all of these at the end of this, but while I'm on a roll ;) The integration with TravisCI and other cont integ services is particularly nice with pull requests. One thing that I have yet to see is integration of continuous integration & pull requests on paper pipelines - this could be valuable for both collaboration and review. |
Is it realistic to get one of the publishers to "endorse" this proposal on the time scale of Feb 28th? It for sure would make the proposal stronger. How quickly they could make a decision and public statement probably relates to what we ask from them. We should discuss this in #22 or if we think that even for the minimal ask they won't be able to converge before we submit I would punt this to after submission. re: minimal ingredients, in my world we use a Having the flexibility of a |
What is a paper pipeline for you @ctb? For me: a git repository that contains all the code required to produce a paper as well as a The workflow then goes something like this:
To share the latest PDF of the paper we point people at |
Gist exec handles R Markdown in the most basic of ways. |
Agreed on def'n, provisionally :)
What about building a simple specfile that automates the bit of mybinder
where you have to tell it whether to look at requirements.txt or a Dockerfile,
and expanding to specify whether we should run RMarkdown, Jupyter, or blah,
and make, snakemake, or pydoit?
More generally (and this is not well thought out) how about working
towards a base Docker image that contains all the relevant software installs,
and combining that with a specfile that says "here is what to run, here is our
guestimate of compute resources required, and here is where the interesting
output will reside - data files, PDF, etc."?
And then implementing that?
…--titus
|
Crossing one thing off my list as done:
|
Follow-on to previous comment - this specfile could then be used in composition of workflows. I think the idea of (specfile + demo implementation + exploring composition) could be a nice circumscribed proposal to the open science prize. Thoughts? |
That get's quite close to http://bioboxes.org/ no? Some thoughts on this in #16 |
Same idea as bioboxes, different intent and interface ;)
|
Regarding composability of notebooks, it can be done with a "master" notebook (the main narrative) calling other notebooks, optionally passing parameters. There is a tiny function I wrote for the purpose: https://github.com/tritemio/nbrun and a more advanced implementation from @takluyver: https://github.com/takluyver/nbparameterise So the paper.ipynb can optionally be a master notebook executing other notebooks for the various macro-steps of the analysis. This more or less solves the dependency problem regarding the notebooks. For software dependencies, I think specifications of "conda environments" (including versions of each package) can help to be able to rebuild the "software environment" in the years to come (assuming Continuum does not deletes the packages of old software from their archives, but this is unlikely). Conda covers both python and R packages and other basic libraries. Also environments specifications are purely declarative YAML files (as @ctb suggested). I think using a conda environment inside docker would be a great solution. |
Some thoughts concerning composability, which is actually the core issue of this project. There are three points of view concerning composition: science, communication, and technology. In terms of science, an executable paper is composed of ingredients such as models, methods, experimental data, fitted parameters, etc. The details very much depend on the kind of science one is doing. Reusability requires that each ingredient can be replaced by a different one easily. In terms of communication, an executable paper is composed of new material and prior art, to which the new material refers. In terms of technology, we have to deal with the huge mess that we have piled up over a few decades. A ready-to-execute paper is composed of an operating system, compilers, linkers, interpreters, containers, servers, databases, individual datasets, libraries, middleware, software source code, and of course explanations for human readers. Maybe I have forgotten something. The challenge is to align these different points of view in order to get something useable. We need to compose technological artefacts in such a way that we can communicate the science in a way that is understandable and reusable. That is in my opinion the ultimate goal of this project. Ideally, we would have a single kind of technological artefact that is inherently composable. Procedures in a programming language are such artefacts: we can make a procedure that calls a few already existing procedures. Dynamic libraries are also composable: we can make a dynamic library that calls code from a few other dynamic libraries. Binary executables are composable with more effort: we need to write glue code in order to produce a binary executable that calls other binary executables. To compose different kinds of artefacts into a whole, we have to do messy interfacing work. Most of the hard problems in computing are related to composing artefacts that were not designed for being composed: packaging, portability, deployment, dependency hell, DLL hell, software rot, and many more. Composition is the #1 source of accidental complexity. Now let's look at the technologies mentioned here, from the point of view of composability. As far as I know, Docker containers are not composable, though I may be wrong. It doesn't sound impossible in principle to make a container out of three existing containers, but I haven't seen it done. If containers are not composable, there can only be one container in an executable paper. BTW, there is an alternative approach that is composable: packages as defined by Nix or Guix (two implementations of the same concept). Much more promising than containers, in my opinion. Also less popular, because less convenient for software deployment. But our problem is different from software deployment. Notebooks are not composable. You cannot combine two notebooks into a larger notebook, nor into any other useful entity. More importantly, you cannot call code in one notebook from another notebook. That means that notebooks are not reusable either. At best, reuse means that only a small part of a big notebook must be modified in order to do a different computation. Mybinder or Everware compose an environment implemented as a container with a collection of independent notebook into a publishable package. That package is not composable with anything else. On the other hand, this composition aligns very well with the communication aspect: the environment contains the prior art, and the notebooks contain the new stuff. Moreover, it's acceptable that the prior art is not so explorable by the user, as it has presumably been published and explained before. That leaves the question of how to package the "new stuff" in such a way that its individual scientific components are (1) reusable and (2) explained to human readers. Software libraries offer (1) but not (2), and are restricted to code. Notebooks offer (2) but not (1). They can contain code and small datasets. Independent datasets would be a straightforward addition, so data isn't really the problem. Traditional literate programming, as introduced by Knuth, looks like a promising way to integrate code with a human-readable explanation of the science, in a composable way. Unfortunately, it doesn't compose with notebooks into a coherent human-readable document. In summary, what this project really is about is to compose different technologies in such a way that they permit the construction of executable papers by composition of reusable components. |
@tritemio Nbrun looks interesting. Can you compose notebooks recursively using this technique? In other words, can you treat a notebook like a procedure that can call other procedures? |
@khinsen we were probably writing the comment at the same time. I agree with your analysis. For me conda covers most use cases. What's your take on that? Also, a simple composability of notebook is possible with the concept of "master" notebook and "template" notebooks (see nbrun link) that act like functions. It is not as flexible and general as calling a real function but for the macro-steps on the analysis with few parameters it works fairly well (and you have links to go back and forth between master and template notebook if you want to dive into the details). As an example, I recently used the following pipeline:
Notebooks are inter-linked for easy navigation. @khinsen, to answer you last question, yes this procedure can be repeated (a template notebook can call other notebooks with or without parameters). |
@khinsen In principle you can build a complex dependency "graph" but when you use multiple layers you cannot easily "see" the full dependency graph looking only at the master notebook (like when you call a function you don't know how many subfunctions are also called). |
@tritemio Conda is fine for what it contains. For many Python-based projects it's probably good enough. But if you don't use Python, or if you need to compile your own extension modules, then conda starts to be as much of a problem as it is of help. In particular on MacOSX, where you need a very peculiar Apple SDK installation if you want to link to libraries supplied by conda. Euhh... I just noticed that you wrote "conda" but not "anaconda". Conda on its own is just a build and deployment tool. I wouldn't want to package all my software from scratch using conda! |
On Tue, Feb 23, 2016 at 7:54 PM Konrad Hinsen notifications@github.com
So the fact that you have to have a human read each Dockerfile, think about
(Below I use notebook as a place holder for any narrative+code document, In my experience only those with a wish for insanity create notebooks So I think of the I think finding the right balance for a complicated problem like this will I can think of several uses cases from LHC which won't work with what is |
@tritemio I am having second thoughts about Nbrun. You use the terms "template" and "macro", so I wonder if nbrun runs sub-notebooks in a separate namespace. If not, then that's not proper composition because there is no well-defined interface between the components. A dangerous source of bugs. @betatim I fully agree that only experience will tell what works and what doesn't. But it does help to do some brainstorming about possible difficulties in advance. The only point on which I disagree with what you say is that a documented library is good enough as an explanation of a new model or method in an executable paper. Library documentation is reference style, organized around the code. It explains how the code does something, but it doesn't explain the motivations for doing things, nor the concepts required for understanding new science. You could of course add such material to a library documentation, but that's not where it belongs. It belongs into a narrative specifically written for explaining things. That was Knuth's idea with literate programming. A traditional paper has a section "materials and methods" and a section "results". They belong together and reference each other. It's no good to have "materials and methods" in library documentation and "results" in notebooks. That's a bit like a traditional paper saying that "a description of the methods is available from the authors upon request". A barrier between methods and results that prevents understanding. |
On Wed, Feb 24, 2016 at 9:32 AM Konrad Hinsen notifications@github.com
|
@betatim The question of how to divide the information into files should probably be left to experimentation, and even remain flexible in the long run, to accommodate a maximum of tools and habits. There are various literate programming tools out there, but there are also people who prefer code and comments in separate files. What matters to me is that our tools should not discourage us from writing good explanations - as you say, the hard part is convincing people to actually do it. |
On Wed, Feb 24, 2016 at 12:32 AM, Konrad Hinsen notifications@github.com
Notebooks will never be (at least not easily) as composable as functions. We should use/promote the right abstractions, and at this point I would |
http://cdn.emgn.com/wp-content/uploads/2015/07/Inception-Facts-EMGN3.gif notebooks in notebooks in notebooks in notebooks |
Digital Science is another group you could try reaching out to. They run Figshare, Overleaf, and LabGuru - all devoted to opening up science workflows and outputs in various forms. |
@m3gan0 good idea! |
I know people there, but I don't think we should ask them for an expression of interest at this point - just mention them as part of the ecosystem we hope to work with. protocols.io is another one (we could probably get an expression of interest from Lenny Teytelman quite quickly, actually). |
For continuous integration, we need some indication of what success is to be built in. Is that "zero exit code" or can we put in assertions of some sort?
Konrad Hinsen clearly has some thoughts on composability
We shouldn't tie things to mounting local directories because they don't work with most docker-machine types (see my approach with data volumes]. For a demo or prototype, of course it's ok :)
I really like this concept for some reason: "web based way to create an environment, try it and then download it".
Main reaction: we need to narrow down to some sort of hard focus for the OSP application, around which we build a fairy castle of air that spells out all the awesome things that could be done.
The text was updated successfully, but these errors were encountered: