Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thoughts and questions on a first thorough review #18

Closed
ctb opened this issue Feb 22, 2016 · 35 comments
Closed

Thoughts and questions on a first thorough review #18

ctb opened this issue Feb 22, 2016 · 35 comments

Comments

@ctb
Copy link
Member

ctb commented Feb 22, 2016

For continuous integration, we need some indication of what success is to be built in. Is that "zero exit code" or can we put in assertions of some sort?

Konrad Hinsen clearly has some thoughts on composability

We shouldn't tie things to mounting local directories because they don't work with most docker-machine types (see my approach with data volumes]. For a demo or prototype, of course it's ok :)

I really like this concept for some reason: "web based way to create an environment, try it and then download it".

Main reaction: we need to narrow down to some sort of hard focus for the OSP application, around which we build a fairy castle of air that spells out all the awesome things that could be done.

@betatim
Copy link
Member

betatim commented Feb 22, 2016

On Mon, Feb 22, 2016 at 9:28 PM C. Titus Brown notifications@github.com
wrote:

For continuous integration, we need some indication of what success is to
be built in. Is that "zero exit code" or can we put in assertions of some
sort?

We should start with a script that runs paper.ipynb and tells you it that
was a success or not. Maybe it could also run other notebooks found in the
top level directory?

Konrad Hinsen clearly has some thoughts on composability
http://ivory.idyll.org/blog/2016-mybinder.html#comment-2520035392

I think a lot of people think "notebooks for everything" when you mention
them. I think it should be more a plumbing vs porcelain approach. The
notebook explains how and why and the story of driving your analysis. Large
parts of that analysis code will be in big libraries (http://root.cern.ch, numpy, and friends), or 'library' code for your analysis but
the story can be written down in a notebook with the few salient function
calls.

(There is also some work by guys from IBM to import notebooks into
notebooks, not sure yet what I think of that)

We shouldn't tie things to mounting local directories because they don't
work with most docker-machine types (see my approach with data volumes
http://ivory.idyll.org/blog/2015-transcriptomes-with-docker.html]. For
a demo or prototype, of course it's ok :)

Did we write mount locally somewhere? The only use case I can think of is
when I run the analysis on my laptop/desktop. I'd like to mount the git
repo into the container so I can edit it from the outside (with my
favourite emacs).

I really like this concept for some reason: "web based way to create an
environment, try it and then download it".

Main reaction: we need to narrow down to some sort of hard focus for the
OSP application, around which we build a fairy castle of air that spells
out all the awesome things that could be done.

👍

I see two things to focus on: educational material (how to make these
executable papers) or build the infrastructure needed to host a 'journal'
which shows them off. Despite thinking that the education part is the
bigger problem, my naive guess would be that the infrastructure part is
more grant-y.

@ctb
Copy link
Member Author

ctb commented Feb 22, 2016

On last point - how about connecting with gigascience, or gitxiv; and we can write in connections to software and data carpentry although serious lesson dev would be out of scope and budget for first round, I think.

@betatim
Copy link
Member

betatim commented Feb 22, 2016

I have Gigascience or PeerJ (or PLOS??) on my list of publishers that should be up for this but I unfortunately have no contacts there what so ever. Do you know someone who could make an introduction?

I could potential be introduced to someone from the arxiv but they seem extremely busy and hence quite conservative towards new-new ideas.

Conclusion: focus on the web app/tools part. Lessons and education as nice to haves. (please disagree if you do)

@ctb
Copy link
Member Author

ctb commented Feb 22, 2016

I'm on Ed board at GigaScience and PeerJ CS.

@JackDapid
Copy link
Contributor

I know somebody from Springer nature - I think they own Gigascience - at least last time I checked. can try to contact them as soon there is a first distributable version

@odewahn
Copy link
Contributor

odewahn commented Feb 22, 2016

O'Reilly also has some connections to PeerJ -- who would you be trying to reach there?

@JackDapid
Copy link
Contributor

Another idea we are working on - a science hackathon: I would love to facilitate that papers written there could try the "dynamic & interactive" way - very early proposal: https://docs.google.com/document/d/1HwiQxyVG1CnW6AUbFQ-0yMT-BYHi7MnVzXCKSId-xXg

@betatim
Copy link
Member

betatim commented Feb 23, 2016

@odewahn not sure who you'd want to contact. Educated guess: editors. They are scientists themselves so could be excited by the prospect, if they like the idea they can champion it within the publisher.

@ctb
Copy link
Member Author

ctb commented Feb 23, 2016

Oh, and we could talk to biorxiv, too. I don't think there's a shortage of publishers that would be interested.

But, this leads in another interesting direction - one of the big concerns I see from the perspective of publishers and librarians is that the technology and formats are changing very fast, so it's not at all clear that in (eg) 5 years we will be able to run today's Jupyter Notebooks inside of Docker. Perhaps part of our proposal could focus on doing something about that in the next year - it's probably too early to build standards, but defining the minimal ingredients could be useful at this point.

While I'm randomly brainstorming, any thoughts on bringing the R community (which is QUITE large in bio and biostats) technology into the fold here? I have some experience with RStudio and RMarkdown, less with Shiny.

@ctb
Copy link
Member Author

ctb commented Feb 23, 2016

p.s. I can broker introductions with many journal editors. We should figure out what we want to say rather than worrying too much about who to say it to :)

@ctb
Copy link
Member Author

ctb commented Feb 23, 2016

(I'll summarize all of these at the end of this, but while I'm on a roll ;)

The integration with TravisCI and other cont integ services is particularly nice with pull requests. One thing that I have yet to see is integration of continuous integration & pull requests on paper pipelines - this could be valuable for both collaboration and review.

@betatim
Copy link
Member

betatim commented Feb 23, 2016

Is it realistic to get one of the publishers to "endorse" this proposal on the time scale of Feb 28th? It for sure would make the proposal stronger. How quickly they could make a decision and public statement probably relates to what we ask from them. We should discuss this in #22 or if we think that even for the minimal ask they won't be able to converge before we submit I would punt this to after submission.

re: minimal ingredients, in my world we use a paper.md which contains markdown plus code blocks, that is the "paper". We will always be able to read that and rerun it. The tool to do so today that I'd use is the jupyter infrastructure of kernels. This exists as gistexec, interactive posts, and Rmarkdown. Not sure yet about docker ... or how to replace it. However fairly confident that we will always be able to convert a Dockerfile to the-new-big-thing. And as lots and lots of people with a lot of cash use it, chances are someone will create the tool for us.

Having the flexibility of a paper.md or a paper.pynb as the executable paper should make it easy to get the R crowd on our side. Paging Dr. @rgbkrk have you tried feeding Rmarkdown to gistexec? Creating Rmarkdown from RStudio seems quite straightforward (never tired but watched people do it). is this something people do? One exercise left for the reader would be to work out how to make RStudio run stuff in a docker container.

@betatim
Copy link
Member

betatim commented Feb 23, 2016

What is a paper pipeline for you @ctb? For me: a git repository that contains all the code required to produce a paper as well as a Dockerfile to create the environment in which it runs. It contains a script or text file describing to you what commands to type in what order (ideally it would be a Makefile).

The workflow then goes something like this:

  1. Tim develops cool new colour scheme for plots
  2. (re)runs locally to check it works
  3. looks at locally made plots/figures/tables
  4. git commit
  5. create PR
  6. CI runs it and says "yes works"
  7. CI uploads plots or other "build artefacts" somewhere for later use/inspection

To share the latest PDF of the paper we point people at www.build-artefacts.com/betatim/icecream-prefs/latest where they get the latest output of the CI run.

@rgbkrk
Copy link

rgbkrk commented Feb 23, 2016

Gist exec handles R Markdown in the most basic of ways.

@ctb
Copy link
Member Author

ctb commented Feb 23, 2016 via email

@betatim
Copy link
Member

betatim commented Feb 23, 2016

Crossing one thing off my list as done:

  • feeding Rmarkdown to jupyter R kernels

@ctb
Copy link
Member Author

ctb commented Feb 23, 2016

Follow-on to previous comment - this specfile could then be used in composition of workflows.

I think the idea of (specfile + demo implementation + exploring composition) could be a nice circumscribed proposal to the open science prize. Thoughts?

@betatim
Copy link
Member

betatim commented Feb 23, 2016

That get's quite close to http://bioboxes.org/ no?

Some thoughts on this in #16

@ctb
Copy link
Member Author

ctb commented Feb 23, 2016 via email

@tritemio
Copy link

Regarding composability of notebooks, it can be done with a "master" notebook (the main narrative) calling other notebooks, optionally passing parameters. There is a tiny function I wrote for the purpose:

https://github.com/tritemio/nbrun

and a more advanced implementation from @takluyver:

https://github.com/takluyver/nbparameterise

So the paper.ipynb can optionally be a master notebook executing other notebooks for the various macro-steps of the analysis. This more or less solves the dependency problem regarding the notebooks.

For software dependencies, I think specifications of "conda environments" (including versions of each package) can help to be able to rebuild the "software environment" in the years to come (assuming Continuum does not deletes the packages of old software from their archives, but this is unlikely). Conda covers both python and R packages and other basic libraries. Also environments specifications are purely declarative YAML files (as @ctb suggested). I think using a conda environment inside docker would be a great solution.

@khinsen
Copy link
Collaborator

khinsen commented Feb 23, 2016

Some thoughts concerning composability, which is actually the core issue of this project.

There are three points of view concerning composition: science, communication, and technology.

In terms of science, an executable paper is composed of ingredients such as models, methods, experimental data, fitted parameters, etc. The details very much depend on the kind of science one is doing. Reusability requires that each ingredient can be replaced by a different one easily.

In terms of communication, an executable paper is composed of new material and prior art, to which the new material refers.

In terms of technology, we have to deal with the huge mess that we have piled up over a few decades. A ready-to-execute paper is composed of an operating system, compilers, linkers, interpreters, containers, servers, databases, individual datasets, libraries, middleware, software source code, and of course explanations for human readers. Maybe I have forgotten something.

The challenge is to align these different points of view in order to get something useable. We need to compose technological artefacts in such a way that we can communicate the science in a way that is understandable and reusable. That is in my opinion the ultimate goal of this project.

Ideally, we would have a single kind of technological artefact that is inherently composable. Procedures in a programming language are such artefacts: we can make a procedure that calls a few already existing procedures. Dynamic libraries are also composable: we can make a dynamic library that calls code from a few other dynamic libraries. Binary executables are composable with more effort: we need to write glue code in order to produce a binary executable that calls other binary executables. To compose different kinds of artefacts into a whole, we have to do messy interfacing work. Most of the hard problems in computing are related to composing artefacts that were not designed for being composed: packaging, portability, deployment, dependency hell, DLL hell, software rot, and many more. Composition is the #1 source of accidental complexity.

Now let's look at the technologies mentioned here, from the point of view of composability.

As far as I know, Docker containers are not composable, though I may be wrong. It doesn't sound impossible in principle to make a container out of three existing containers, but I haven't seen it done. If containers are not composable, there can only be one container in an executable paper.

BTW, there is an alternative approach that is composable: packages as defined by Nix or Guix (two implementations of the same concept). Much more promising than containers, in my opinion. Also less popular, because less convenient for software deployment. But our problem is different from software deployment.

Notebooks are not composable. You cannot combine two notebooks into a larger notebook, nor into any other useful entity. More importantly, you cannot call code in one notebook from another notebook. That means that notebooks are not reusable either. At best, reuse means that only a small part of a big notebook must be modified in order to do a different computation.

Mybinder or Everware compose an environment implemented as a container with a collection of independent notebook into a publishable package. That package is not composable with anything else. On the other hand, this composition aligns very well with the communication aspect: the environment contains the prior art, and the notebooks contain the new stuff. Moreover, it's acceptable that the prior art is not so explorable by the user, as it has presumably been published and explained before.

That leaves the question of how to package the "new stuff" in such a way that its individual scientific components are (1) reusable and (2) explained to human readers. Software libraries offer (1) but not (2), and are restricted to code. Notebooks offer (2) but not (1). They can contain code and small datasets. Independent datasets would be a straightforward addition, so data isn't really the problem.

Traditional literate programming, as introduced by Knuth, looks like a promising way to integrate code with a human-readable explanation of the science, in a composable way. Unfortunately, it doesn't compose with notebooks into a coherent human-readable document.

In summary, what this project really is about is to compose different technologies in such a way that they permit the construction of executable papers by composition of reusable components.

@khinsen
Copy link
Collaborator

khinsen commented Feb 23, 2016

@tritemio Nbrun looks interesting. Can you compose notebooks recursively using this technique? In other words, can you treat a notebook like a procedure that can call other procedures?

@tritemio
Copy link

@khinsen we were probably writing the comment at the same time. I agree with your analysis. For me conda covers most use cases. What's your take on that?

Also, a simple composability of notebook is possible with the concept of "master" notebook and "template" notebooks (see nbrun link) that act like functions. It is not as flexible and general as calling a real function but for the macro-steps on the analysis with few parameters it works fairly well (and you have links to go back and forth between master and template notebook if you want to dive into the details).

As an example, I recently used the following pipeline:

  1. A number of template notebooks: these accept input parameters, do normal analysis/plots and save the important results in CSV.
  2. A single "master" notebook executes all the template notebooks with all the input parameters that are necessary
  3. A "summary" notebook loads and plot/represents the results.

Notebooks are inter-linked for easy navigation.

@khinsen, to answer you last question, yes this procedure can be repeated (a template notebook can call other notebooks with or without parameters).

@tritemio
Copy link

@khinsen In principle you can build a complex dependency "graph" but when you use multiple layers you cannot easily "see" the full dependency graph looking only at the master notebook (like when you call a function you don't know how many subfunctions are also called).

@khinsen
Copy link
Collaborator

khinsen commented Feb 23, 2016

@tritemio Conda is fine for what it contains. For many Python-based projects it's probably good enough. But if you don't use Python, or if you need to compile your own extension modules, then conda starts to be as much of a problem as it is of help. In particular on MacOSX, where you need a very peculiar Apple SDK installation if you want to link to libraries supplied by conda.

Euhh... I just noticed that you wrote "conda" but not "anaconda". Conda on its own is just a build and deployment tool. I wouldn't want to package all my software from scratch using conda!

@betatim
Copy link
Member

betatim commented Feb 23, 2016

On Tue, Feb 23, 2016 at 7:54 PM Konrad Hinsen notifications@github.com
wrote:

As far as I know, Docker containers are not composable, though I may be
wrong. It doesn't sound impossible in principle to make a container out of
three existing containers, but I haven't seen it done. If containers are
not composable, there can only be one container in an executable paper.

I think the best you can do is mount a container inside another. For this
to work the containers probably would have to have been designed to be used
together like this. Then there is bioboxes where you treat each container
as a blackbox. I am not sure I like this approach.

BTW, there is an alternative approach that is composable: packages as
defined by Nix http://nixos.org/ or Guix
http://www.gnu.org/software/guix/ (two implementations of the same
concept). Much more promising than containers, in my opinion. Also less
popular, because less convenient for software deployment. But our problem
is different from software deployment.

I am not so worried about the fact that I can not automatically merge the
environments of two separate executable papers. I would posit that a
successful automatic merge is only possible in a small fraction of cases.
In the majority you would need a human to decide how to resolve conflicting
versions of the same package or their dependencies.

So the fact that you have to have a human read each Dockerfile, think about
it and create (by hand) a third one that is the merger is not a big
practical downside.

Notebooks are not composable. You cannot combine two notebooks into a
larger notebook, nor into any other useful entity. More importantly, you
cannot call code in one notebook from another notebook. That means that
notebooks are not reusable either. At best, reuse means that only a small
part of a big notebook must be modified in order to do a different
computation.

In addition to nbrun there is work going on by guys from IBM
https://github.com/jupyter-incubator/contentmanagement

Mybinder or Everware compose an environment implemented as a container
with a collection of independent notebook into a publishable package. That
package is not composable with anything else. On the other hand, this
composition aligns very well with the communication aspect: the environment
contains the prior art, and the notebooks contain the new stuff. Moreover,
it's acceptable that the prior art is not so explorable by the user, as it
has presumably been published and explained before.

It is somewhat composable, you can treat it as a blackbox which has zero
inputs and produces some output. Which is a little better than nothing at
all, but not much.

That leaves the question of how to package the "new stuff" in such a way
that its individual scientific components are (1) reusable and (2)
explained to human readers. Software libraries offer (1) but not (2), and
are restricted to code. Notebooks offer (2) but not (1). They can contain
code and small datasets. Independent datasets would be a straightforward
addition, so data isn't really the problem.

I disagree, libraries are (1) and (2), if the maintainers bother to write
the documentation. (I use library to mean a contained bit of code that lots
of people use (like scikit-learn, glibc, ROOT,...) Not a shared library.so,
no one should use those unless they have the source.

(Below I use notebook as a place holder for any narrative+code document,
ipynb, Rmarkdown, ...)

In my experience only those with a wish for insanity create notebooks
longer than a few hundred lines of code. It quickly becomes unwieldy. The
builtin editor is not up to scratch compared to emacs/vim/atom. What ends
up happening is that people explore ideas using a notebook and then create
a plain .py or .R file which contains the end result of the exploration.
Over time all the code forms a library for this paper. There is often only
one or two notebooks that then use this code. The library contains all the
plumbing and the notebook drives it. Connects the high level commands with
narrative and displays the results of the research. Maybe it does some
small calculations right then and there.

So I think of the paper.{md,ipynb} as the cockpit from which you control
the analysis and have shiny instruments informing you about the state of
the plane. If you want to know how the fuel gauge works, you take of the
panel and follow the cables down. Just like you follow a function call or
shell script invocation to find out what it really does.

I think finding the right balance for a complicated problem like this will
only be possible by proposing a solution, building it, using it, finding
out why it sucks, and starting a new one. Then iterating a few times. In
the spirit of "most code is written so it can be deleted" ;)

I can think of several uses cases from LHC which won't work with what is
proposed here. For example you could ask is a paper truly reusable if I
don't also provide you with the several CPU years it takes to actually run
it from start to finish? What if the data is so large that it can only be
accessed from machines (close to the data) to which only CERN users have
access? For a first attempt at building something like this we should not
get distracted by the reasons why it will never work, but focus on the
reasons why it will work.

@khinsen
Copy link
Collaborator

khinsen commented Feb 24, 2016

@tritemio I am having second thoughts about Nbrun. You use the terms "template" and "macro", so I wonder if nbrun runs sub-notebooks in a separate namespace. If not, then that's not proper composition because there is no well-defined interface between the components. A dangerous source of bugs.

@betatim I fully agree that only experience will tell what works and what doesn't. But it does help to do some brainstorming about possible difficulties in advance.

The only point on which I disagree with what you say is that a documented library is good enough as an explanation of a new model or method in an executable paper. Library documentation is reference style, organized around the code. It explains how the code does something, but it doesn't explain the motivations for doing things, nor the concepts required for understanding new science. You could of course add such material to a library documentation, but that's not where it belongs. It belongs into a narrative specifically written for explaining things. That was Knuth's idea with literate programming.

A traditional paper has a section "materials and methods" and a section "results". They belong together and reference each other. It's no good to have "materials and methods" in library documentation and "results" in notebooks. That's a bit like a traditional paper saying that "a description of the methods is available from the authors upon request". A barrier between methods and results that prevents understanding.

@betatim
Copy link
Member

betatim commented Feb 24, 2016

On Wed, Feb 24, 2016 at 9:32 AM Konrad Hinsen notifications@github.com
wrote:

@betatim https://github.com/betatim I fully agree that only experience
will tell what works and what doesn't. But it does help to do some
brainstorming about possible difficulties in advance.

Many 👍 on this point. You need feedback loops everywhere and I think we
are doing an Ok job here attracting brains to give feedback and then
discuss! Felt like pointing out that we should not fall into the trap of
"oh this will never work", because I see so many good ideas derailed by
that. Just too easy to think of reasons why something won't work ;)

The only point on which I disagree with what you say is that a documented
library is good enough as an explanation of a new model or method in an
executable paper. Library documentation is reference style, organized
around the code. It explains how the code does something, but it doesn't
explain the motivations for doing things, nor the concepts required for
understanding new science. You could of course add such material to a
library documentation, but that's not where it belongs. It belongs into a
narrative specifically written for explaining things. That was Knuth's idea
with literate programming.

I need to refresh my memory of Knuth's literate programming a bit. Right
now I am undecided on whether having docs/moduleA.md contain the
narrative documentation for code/moduleA.py is good or bad. Or if it
would be better to have literate/moduleA.lit from which we generate the
code and the docs. IMHO the big challenge is to get authors to write any
kind of narrative docs. Maybe my expectations are too low so that I am
happy with any form of narrative docs. (The API docs belongs in the code
and then we generate a nice HTML/PDF/.. from it)

A traditional paper has a section "materials and methods" and a section
"results". They belong together and reference each other. It's no good to
have "materials and methods" in library documentation and "results" in
notebooks. That's a bit like a traditional paper saying that "a description
of the methods is available from the authors upon request". A barrier
between methods and results that prevents understanding.

Yes. Though we could envision having a paper that says "a description of
the methods/code is available in this hyperlink:sub-document" which is
delivered with the paper because it is in the same repository. Just a
different document.

@khinsen
Copy link
Collaborator

khinsen commented Feb 24, 2016

@betatim The question of how to divide the information into files should probably be left to experimentation, and even remain flexible in the long run, to accommodate a maximum of tools and habits. There are various literate programming tools out there, but there are also people who prefer code and comments in separate files. What matters to me is that our tools should not discourage us from writing good explanations - as you say, the hard part is convincing people to actually do it.

@tritemio
Copy link

On Wed, Feb 24, 2016 at 12:32 AM, Konrad Hinsen notifications@github.com
wrote:

@tritemio https://github.com/tritemio I am having second thoughts about
Nbrun. You use the terms "template" and "macro", so I wonder if nbrun runs
sub-notebooks in a separate namespace. If not, then that's not proper
composition because there is no well-defined interface between the
components. A dangerous source of bugs.

The sub-notebook is always executed by a new ipython kernel so it's a
different process, there is no namespace sharing. You can pass arguments
that are serializable and results are written down in output files. What is
not formally defined is the "notebook signature". You need to open the
sub-notebook to learn which arguments you can pass. There is no
introspection and there is no error checking that you are passing arguments
with the right names. This checks are implemented by @takluyver's
nbparametrize, so it is technically possible. Similarly, you have to look
at the notebook to learn what results it saves.

Notebooks will never be (at least not easily) as composable as functions.
But as an outer layer composition of macro-steps (and by macro I mean
"big", high-level) they will work well IMHO.

We should use/promote the right abstractions, and at this point I would
note encourage this type of notebook composition beyond 1 or 2 layers of
notebooks calls (i.e. notebook which calls notebook which calls notebook).

@betatim
Copy link
Member

betatim commented Feb 24, 2016

http://cdn.emgn.com/wp-content/uploads/2015/07/Inception-Facts-EMGN3.gif

notebooks in notebooks in notebooks in notebooks

@m3gan0
Copy link

m3gan0 commented Feb 24, 2016

Digital Science is another group you could try reaching out to. They run Figshare, Overleaf, and LabGuru - all devoted to opening up science workflows and outputs in various forms.

@ctb
Copy link
Member Author

ctb commented Feb 27, 2016

@m3gan0 good idea!

@betatim
Copy link
Member

betatim commented Feb 27, 2016

Do we know anyone there? If yes @m3gan0 could you post in #22 ?

@ctb
Copy link
Member Author

ctb commented Feb 27, 2016

I know people there, but I don't think we should ask them for an expression of interest at this point - just mention them as part of the ecosystem we hope to work with. protocols.io is another one (we could probably get an expression of interest from Lenny Teytelman quite quickly, actually).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants