Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A new notebook document format for improved workflow integration #4

Closed
wants to merge 11 commits into from

Conversation

khinsen
Copy link

@khinsen khinsen commented Sep 14, 2015

For the background, see this blog post.

@Carreau
Copy link
Member

Carreau commented Sep 14, 2015

Hey Konrad.

One of the things I would like to distinguish more in Jupyter notebook is the in-memory format, vs on disk format. There are for sure things that you can keep in memory that give you more information of not-yet ran cell, and wether the kernel has restarted and cell are not in sync with kernel, that do not (obviously) belong on disk.

I'll read your proposal with more attention later.

Thanks !


## Problem

Jupyter notebooks do not integrate well with other tools supporting complex workflows in computational science. Version control systems require a clear separation of human-edited content and computed content. The current notebook file format mixes them. Workflow managers and provenance trackers require that all computations be replicable. For interactive computations, replicability requires storing a full log of user actions. The current notebook file format does not preserve this information, although it is available at execution time.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I agree with this statement about the separation of human and compute content in VCS. Also, I think your working definition of replicability is subtle enough that many folks in the community will disagree with your statement about it requiring a full log of user actions. More background on your definitions would be helpful. To make it more clear, we regularly speak of the notebook as offering reproducibility for computations.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my definitions of replicability and reproducibility see my blog post. This specific use of the terms is quite common by now, but not yet universal. In short, replication refers to repeating a calculation identically for verification, whereas reproduction is about re-doing a computational experiment using different tools. Replication is a purely technical step that requires no understanding of the scientific content, whereas reproduction implies understanding a method and implementing it differently.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @khinsen on replicable, I try to use replicable with notebook, even if the habbit of saying reproducible is hard to get rid of. Nothing prevent from linking to content that describe in more precision replicable vs reproducible. Also people that will read this document are most likely more aware of the difference.

@ellisonbg
Copy link
Contributor

Some general comments...

I think some of the ideas you have here are very interesting. The main point for me is that it would be useful to have a full record of code blocks that a kernel runs and a clear link between those code blocks+output and the ones that appear in a notebook. That idea is worth thinking about and is mostly independent of the broader version control issues.

At the same time, given the large number of users we currently have (and their millions of notebooks), there is no way we can completely break the existing notebook format. I am not at all convinced that breaking the existing notebook format is required to address the main point above. It would not be difficult to write a kernel session monitor that records the full record of the cells and their output in a way that is linkable to the same cells in a current format notebook. With a small amount of changes to the notebook format (hashes of code cells and/or cell uuids) the relationship between the kernel record and the notebook document could be strengthened even further.

If you can come up with concrete proposals that address the questions here without requiring any changes to the notebook format, there is a chance that the community could become interested. Most importantly, in order to justify even small breakages to the notebook format, we would need to see that prototypes of the ideas here, that leveraged the existing notebook format, were actually solving user's problems in significant ways.

@khinsen
Copy link
Author

khinsen commented Oct 19, 2015

Sorry, no, I cannot do that. I am not sufficiently familiar with the internals of Jupyter to make such a proposal. The notebook format definition is not sufficient, as it doesn't specify what is and isn't a correct notebook file. For example, if I add a file to the "code cell" structure, is that a change to the notebook format or not?

As for solving user's problems, I am mainly interested in solving non-user's problems, i.e. the problems that prevent people like me from using Jupyter. It is unlikely that there is much demand for those in the existing community. My proposal is about extending the community.

@ellisonbg
Copy link
Contributor

@khinsen some of the statements you are making just aren't true. For example, we have a json schema for the notebook format and we validate notebooks against that schema. Here is that schema:

https://github.com/jupyter/nbformat/blob/master/nbformat/v4/nbformat.v4.schema.json

If a notebook doesn't validate against that schema, then it is not a valid notebook. If it does it is.

@khinsen
Copy link
Author

khinsen commented Oct 20, 2015

@ellisonbg Thanks for the pointer to the schema! There is no reference to this in the notebook format documentation, so it's a bit hard to find and it's not clear whether it is part of the format definition or just a useful tool.

But the main information that's missing from my point of view is a definition of notebook semantics. I have added an example to the repository which is syntactically valid but semantically invalid: the output doesn't match the source code.

My tiny example is obviously wrong, so it's not a real problem. But for more complex computations it is not obvious which relations between source code and output are supposed to hold inside a notebook file. This is a core issue for replicability. It is also an issue for version control, because merge operations can easily lead to syntactically correct but semantically invalid files.

There is no way to validate semantics with reasonable effort, so notebook files that have been tampered with (such as my example) are not easy to detect. But a good notebook format should allow detection of accidentally introduced semantic inconsistencies. This is why my proposal includes SHA-1 hashes.

Could such hashes be added to the current notebook format? Syntactically, this looks difficult: if I understand the schema correctly, there is no room for adding fields. Perhaps one could figure out a way to squeeze this information into existing fields somehow. But the first question is: does the notebook format make any promises about consistency at all?

@Carreau
Copy link
Member

Carreau commented Oct 20, 2015

There is no reference to this in the notebook format documentation, so it's a bit hard to find and it's not clear whether it is part of the format definition or just a useful tool.

Good point, we can try to fix that.

About CRC, and other cryptographic sum that insure consistency, I (personally) think it will be a tough sell to make them mandatory, and tools would have to implement them correctly to guaranty consistency. A tool can perfectly save a 3+1 = 7 notebok with valid hashes.

We had discussion on marking "dirty" cells in UI, which turned out to be more complicated than we thought. One of the problem with current way the notebook works is that the kernel can get disconnected so some decision on how to persist what where are a bit weird,
in particular there is a in-memory vs on-disk format. You could have a in-memory which is not-yet consistant (waiting for kernel reply, contain ID of future reply), while on-disk have to be consistant. This is not something we do currently.

Could such hashes be added to the current notebook format?

Yes,

if I understand the schema correctly, there is no room for adding fields

No the current schema does support adding keys. In general metadata:{} are arbitrary, and up for interpretation by implementations.

Some extra-field in other place make the notebook valid but cell become unrecognized , so technically valid, but implementations are allow to ignore these.

This would allow us to make a minor revision, by adding fields, that will not be backward incompatible. Though, before comitting to, for example a sha1 key at top level, nothing prevent us or any one to to play with metadata.sha1= <sha1>, this would be just ignored b other implementation.

Jhamrick had a prototype of that to grade notebook with nbgrader, in order to check that the test-case cell where not tampered with by students (in the end the hash was moved to SQlite for other reason), but the metadata does contain other info which is nbgrader specific.

But the first question is: does the notebook format make any promises about consistency at all?

In the format itself, no. There used to be an optional signature to be sure the notebook was actually generated by the current machine (for security).
This was moved to a sqlite (Library/Jupyter/nbsignatures.db on OS X), so hashing and having (some) guaranties of consistency is possible but likely a hard problem.
In particular, I am concerned that if the requirement to create a valid notebook are too high, people will just not use them.

Does that make sens and respond to some of your question ?

I can try to see if I can come up with a nbconvert plugin that hash all cells, store the hash, and allows you to check the hash. Would that help ?

@khinsen
Copy link
Author

khinsen commented Oct 21, 2015

Making hashes optional sounds fine, as long as it is straightforward for users to produce notebooks that do contain them. Any tool attempting validation would flag a hash-less notebook as "dubious".

The point of hashes is not to prevent buggy software from producing wrong notebooks; there is no way to prevent that in general. The point is to allow merging of independent changes to a notebook and recognize output data that has become invalidated in the course of the merge. However, I am not convinced that the addition of hashes is of much interest in itself. To make notebooks good citizens of version controlled repositories, I think it is also necessary to separate human input from computational output as I explain in my proposal. The reason is that merging differences in the computational output will most likely lead to a complete mess, including syntactically wrong MIME data and other unpleasant things.

I looked at the discussion about "dirty" cells and it seems to me that the difficulties with that idea are ultimately the same as the problems I am trying to solve with this proposal: the current notebook data model has no clear notion of dependencies between its data items. My "stale output" cell type addresses the same issue as those "dirty" cells but does so on the basis of real computational dependency information.

I don't quite understand the issue of making the requirements for creating a valid notebook too difficult. None of what I propose requires any user intervention. It's the Jupyter notebook tool that should do all the work behind the scenes.

@rgbkrk
Copy link
Member

rgbkrk commented Oct 21, 2015

You can't verify the accuracy of all computations with hashes alone. You can't even fully verify with certifying algorithms. Trivial ones certainly, but you're still also at the behest of the operating environment (versions of software, hardware, etc.) That's not to say that it shouldn't be done or isn't a plausible goal, just that it is a way larger scope than can be dictated in this proposal.

@minrk
Copy link
Member

minrk commented Oct 21, 2015

If the primary goal is separating input from output for version control, this can be done relatively simply, and there are a variety of ways to go about it (ipymd does it, nbexplode does it, etc.). Hashes are one possible implementation detail for locating output with its matching input, and since those hashes would reside exclusively in the not-always-tracked output file / directory / database / whatever, they wouldn't be polluting anything. We've talked about the 'output sidecar' file before, and could consider adopting one such implementation as an optional, official way to split the notebook storage.

@Carreau
Copy link
Member

Carreau commented Oct 21, 2015

I don't quite understand the issue of making the requirements for creating a valid notebook too difficult. None of what I propose requires any user intervention. It's the Jupyter notebook tool that should do all the work behind the scenes.

Making a field optional and hard to get the semantic right is a receipt to get something not or badly used. We can do it right in the notebook, but people rely
notebook format to be simple enough to generate their own.

I don't want to get to something like windows vista UAC where everybody clicks without reading.

@khinsen
Copy link
Author

khinsen commented Oct 21, 2015

@Carreau Which programs other than Jupyter actually create notebook files from scratch? I have tried to find some but so far without success.

@Carreau
Copy link
Member

Carreau commented Oct 21, 2015

Pycharm from the top of my head.

@Carreau
Copy link
Member

Carreau commented Oct 21, 2015

Sphinx gallery from Gael Varoquaux want to auto-generate notebook from sphinx doc, so that you can write docs as rst and have a "download as notebook" for user. In progress maybe not finished yet.

ipymd have to generate at least in memory one, runipy, likely too as they have templated variables.

I don't know how much they rely on nbformat to do so though.

@khinsen
Copy link
Author

khinsen commented Oct 26, 2015

I saw a presentation this morning at the Saclay Open Software Day on Sphinx Gallery and also another project that generates notebooks as a documentation of a computation. I think they actually illustrate the problem I am trying to solve, because they use notebooks not as a storage and exchange format, but for output only - it's strictly one-way. A bit like generating PDF, with some obvious added value. The goal of my proposal is that such tools could read and write notebooks.

@Carreau
Copy link
Member

Carreau commented Oct 26, 2015

Do you know if these presentations have been recorded. I saw Gael make a 5min Lightning Talk on Sphinx Gallery, but would like to know more.

I'm not sure why Sphinx Gallery couldn't read notebooks, IIRC Gael was complaining about manual edition, not format.

Also @fperez is likely to be around Saclay these days, you might be able to get a back and forth with him in person, which might much more productive than discussing by mail.

@khinsen
Copy link
Author

khinsen commented Oct 26, 2015

There's a camera next to me, so I suppose the sessions were recorded. I'll post a link when I know more. And yes, @fperez is here as well, he gave the opening keynote.

@Carreau
Copy link
Member

Carreau commented Oct 26, 2015

Ok, great ! Say Hi ! (and looking forward for the video)

@khinsen
Copy link
Author

khinsen commented Oct 29, 2015

@Carreau The videos are up! Unfortunately I didn't find an occasion to talk to @fperez about anything technical such as this issue.

@Carreau
Copy link
Member

Carreau commented Oct 29, 2015

@Carreau The videos are up! Unfortunately I didn't find an occasion to talk to @fperez about anything technical such as this issue.

Thanks for the head's up. I'll try to find some time to watch and it might help me to understand !

@timoc
Copy link

timoc commented Mar 18, 2016

@khinsen have you seen org-markup?
It is a plain text markup language like markdown, but better. It would seem to be the natural format of a juypter notebook, as embedding data and executable code is a native feature. It can be used for literate programming and to create reproducible research - see [http://orgmode.org/worg/org-papers.html] and [http://orgmode.org/worg/org-contrib/babel/uses.html] or [http://orgmode.org/worg/org-contrib/babel/examples/data-collection-analysis.html].

Bonus: Its a native github format too [https://github.com/fniessen/refcard-org-mode/blob/master/README.org].

@ellisonbg org-markup (specifically org-babel) already has mechanisms to separate the source from the result of any given embedded calculation. Even better it has tagging support. Tagging means you can tag parts of the document, to assign completeness status (q.g. TODO) or what is executed at publishing time (e.g. noexport). I use this feature myself as part of my test driven document development process, and literate programing development process.

in addition:

  • There are org-markup parser libraries available in many languages [http://orgmode.org/worg/org-tools/index.html] including python [https://github.com/bjonnh/PyOrgMode], though some may not support all of the the babel features.
  • Org-markup is plain text and so can be better managed with source code control. It is a 'source document' that can also store arbitrary data in tables and calculations, or SQL queries etc. diffing and merging are easier due to the plain text nature, but there is also a git merge tool somewhere too.
  • Org-markup can be used to create publishable documents in many formats. For example i am using org source files and pandoc to create ms-word and PDF documents.
  • Org-markup also supports blogging and project management, among other things.
  • update:
    I should also mention it can embed uml digrams (plantuml), graphs (gnuplot), images, any many other media sources.

@khinsen
Copy link
Author

khinsen commented Mar 18, 2016

@timoc Yes, I know org-markup, I use it all the time for lots of things. And yes, it is one step up from Jupyter's format in terms of managing the ingredients of a notebook. But it doesn't keep a trace of the computation either, so in my view it is not sufficient.

@timoc
Copy link

timoc commented Mar 18, 2016

@khinsen , maybe I'm missing the point of this feature request. I am completely new to jupyter, and i came from an emacs background using org mode. I posted to this feature request explicitly because i saw the overlap.

If i understand this feature request at all, its more from the comments than the premise, but if i understand premise of your original feature, it is to separate these concerns. I agree.

The concerns being those of the (org/jupyter) document as a source artefact, that of the 'computation' as one or more compilation artefact(s), and that of the result, which is the final set of result artefact(s) based on the 'compilation' artefacts. Even in a distributed computation environment, this would seem to be the case. This seems to be the same process you find in any sufficiently mature continuous build and test and delivery infrastructure, if you separate the concerns as you outline.

I think org-markup is the choice for the source document format, because with tags you can encode the code to test and validate the outcome in the org document. I would suggest a pre, post and final tagset, so that computational code fragments that can be used to validate (possibly with a hash?) the computation and result artefacts as part of a traditional build approach.

I have yet to look at any of the videos, so maybe i am being naive about the challenges you face that org does not address.

in the presumption my assumptions are not correct, can you suggest an English presentation that will give better context on this problem?

@meeseeksmachine
Copy link

This pull request has been mentioned on Jupyter Community Forum. There might be relevant details there:

https://discourse.jupyter.org/t/jupyter-and-github-alternative-file-formant/4972/38

@meeseeksmachine
Copy link

This pull request has been mentioned on Jupyter Community Forum. There might be relevant details there:

https://discourse.jupyter.org/t/jupyter-and-github-alternative-file-formant/4972/41

@meeseeksmachine
Copy link

This pull request has been mentioned on Jupyter Community Forum. There might be relevant details there:

https://discourse.jupyter.org/t/jupyter-and-github-alternative-file-formant/4972/51

@Zsailer
Copy link
Member

Zsailer commented Mar 4, 2024

Hi @khinsen, this is Zach from the @jupyter/software-steering-council.

We're working through old JEPs and closing proposals that are no longer active or may not be relevant anymore. Under Jupyter's new governance model, we have an active Software Steering Council who reviews JEPs weekly. We are catching up on the backlog now. Since there has been no active discussion on this JEP in awhile, I'd propose we close it here (we'll leave it open for two more weeks in case you'd like to revive the conversation). If you would like to re-open the discussion after we close it, you are welcome to do that too.

I'd like to mention, this proposal could be replaced by #103, which proposes a Markdown based Notebook format. If you might be interested in joining that conversation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants