-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: 4.5 Format Cell ID #61
Comments
I'm +1 on the idea in general. A couple quick thoughts:
some questions
|
I think this was suggested some time ago, it make some implementation I believe likely really complicated, typically how to do handle:
I think this has the strong potential on having notebook format implementation to be wrong, at least if each cell has an id, and the id must be unique. depending on how the ID is generated it also means that notebook will have randomly variable fields and the order of operation in which you create a notebook change its final (on disk) state (bad for reproducibility) To ensure uniqueness it would be better to change the notebook format into a (list of ids), and a (mapping id to cells). Though that's profoundly different and cell need to know their id, which is not that good. {and then you can change the "list" to any other DAG structure if you wish but let's forgot those two paragraph for now}. I will not oppose to such a change, but I think guaranteeing uniqueness and auto generation will be quite tough, and has a potential to not be followed. |
The metadata object has I do not believe that a formal UUID is necessary because:
Because of the de-duplication I would even say that a counter would be more succinct and sufficient. Implementation:
One of the cells gets a new ID.
On paste give the pasted cell a different ID if there's already one with the same ID as being pasted.
See above.
See above. Additionally on notebook load, if an ID is duplicated then give subsequent cells new IDs and consider this a user edit operation. |
I can get into deeper discussion on the actual JEP as well, but these are good questions. Let me take a stab at answering what I think should be done:
Correct. It stays the same once created.
Yes.
No. Much like copying contents out of one document into another -- you have a new cell with equivalent contends and a new id.
One cell (preferably the one with the top half of the code) keeps the id, the other gets a new id. This could be adjusted if folks want a different behavior without being a huge problem so long as we're consistent.
Correct the copied cell should have a new id -- I should have denoted that cell ids should be unique within a document and not reused.
I'd agree it should try to preserve the ID -- you've moving the cell in entirity.
New id -- you have a duplicate of the original. If we go with a "all cell ids must be unique within a notebook" rule this would trump other behavior when in conflict. |
Thanks for opening this issue, @MSeal! I'm +1 on this proposal overall as it definitely helps with a lot of scenarios that require us to reason about cells as if they were independent entities. The UX considerations that @Carreau points out are good to identify. I think most of the work here will be in establishing the conventions (duplicated cells have unique IDs, etc.) than the technical implementation. Hopefully, we can resolve these in the course of the JEP.
@blois Can you share what string value you use? Do you have universal uniqueness into it or do you expect the IDs to be local per notebook? |
@blois But the additional properties is false for the each cell type: https://github.com/jupyter/nbformat/blob/master/nbformat/v4/nbformat.v4.4.schema.json#L183. This is not adding to metadata but setting the ID in the cell itself.
Open to suggestions on this. UUIDs have been defacto standard for document id fields for a while now in most settings. It simplifies the contract for specifying how one sets a missing id as well as the format of the message in a universal manner. If not UUID we'd need some regex schema we'd want to follow I imagine. I do agree the URLs could get large with a UUID pattern. |
Another thought - do we consider the notebook as a part of uniquely identifying a cell? E.g. is each cell's identity a combination of a notebook + a cell ID, or just a cell ID? I don't think there's anything like a unique ID for notebook either...not sure if that is a topic that has been discussed before. I'm not sure how relevant it is but seems like it could be useful if we're considering ways to have shorter cell IDs so they're nicer from a UX perspective (e.g. if each notebook has a unique notebook ID and cells are referred to by |
I wanted to start with just the cell id first as notebook id has more complications. I believe each cell's identity would be |
@MSeal thanks for the correction. Colab has made heavy use of cell IDs for many many years, it does seem generally useful. I want to stress that Colab generates cell IDs right now 12 characters long (but will accept any string value) and 12 chars is honestly too long given that non-unique IDs within a single notebook have to be dealt with gracefully. Colab's use of cell IDs within the URL seems like a common scenario. |
@captainsafia IDs will be unique within the notebook but Colab will automatically fix conflicts on open. There are many other tools which will generate conflicts. The cell ID is probably best considered a fragment of a URL where the notebook would constitute the rest of the path. A globally unique cell ID would be the combination of a unique notebook ID (URL) and cell ID. In the case of a github notebook this would include repo, path and revision. An example Colab notebook with some auto-generated and manually modified cell IDs is: |
Is there a good way to avoid some of the issues such as jupyter/nbformat#167? Currently in JupyterLab 2.1.2 one can open a notebook with format 4.5, add cells, then save resulting in a 4.5 notebook without cell_ids. |
Gah, I think adding top-level attributes to things that were previously Other thing is: nbformat already has
If it must be guid, 👍 to keep-it-stupid-strings regexen. Having them be well-formed is a great property, but relying on bleeding edge features for a spec seems a hard road. Also the timescale of cell generation is relatively slow vs kernel messages, where you really want to know how wide your messages on the wire are... human readable values would be indeed be best, but if auto-generated: short, starting with a letter, and unlikely to generate profanity (really, this is important for a file format you expect people to be able to email) is probably better than full guid. It would be worth digging up a cross-platform approach that has some of those properties. More broadly: I would not start relying on features from
well.... to my knowledge there is still not an official WADM selector for JSON. So this isn't going to move the dial on making the format annotateable in-place... directly. However, hoisting the concern that most clients would actually start populating the id (whatever it's called) is a good start however, as we could describe the projection the id into a concrete representation that can be annotated: in most Jupyter/hypothes.is cases, that would be the DOM: e.g. As to being able to validate uniqueness: yerp, nope, can't do it with JSON Schema, aside from If nbformat (and other official jupyter ipynb implementations) was going to start enforcing
To further pursue the annotation question: at this point about the only thing that uniquely identifies a cell is a verifiable source of truth, like a git commit, ipfs id, trusted URL endpoint, etc, then the notebook path (probably), and then the cell. Because the first part of that gives you veracity... you can just annotate using the cell number, nothing special there. But useful annotation probably needs to talk about a place in a cell, e.g. "the range of characters 5-10 on line 10 of the output of cell 2". Ugh. True uniqueness is not something a file format can or should be able to enforce. |
@bollwyvl that's a great technical explanation of why this is a difficult problem. I'd like to underscore that it is still extremely useful to have:
Colab has been doing this for quite some time and I think it would benefit the broader ecosystem if it were more broadly available to tools. Colab uses it for:
I'd also like to emphasize that Colab has been getting along fine with |
For the spec, as we've done with message ids, requiring only that it be a unique string within the scope of the notebook is my preference since it allows for a variety of strategies. I would be specific about:
I wouldn't actually recommend using UUIDs in our own default implementations. Lots of large random strings in notebooks can be frustrating, and are something we've tried hard to avoid. 128-bit UUIDs are also vast overkill for the level of uniqueness we need within a notebook with <1000 candidates for collisions. They make for opaque URLs, noise in the files, etc. The shorter and more intelligible the better, especially for something that is to be used in user-visible places like links. It should be a valid strategy, when populating cell ids from a notebook on import from another id-less source or older format version, to use e.g. strings from an integer counter. In fact, if an editor app keeps track of current cell ids, the following strategy ensures uniqueness: cell_id_counter = 0
existing_cell_ids = set()
def get_cell_id(cell_id=None):
"""Return a new unique cell id
if cell_id is given, use it if available (e.g. preserving cell id on paste, while ensuring no collisions)
"""
global cell_id_counter
if cell_id and cell_id not in existing_cell_ids:
# requested cell id is available
existing_cell_ids.add(cell_id)
return cell_id
# generate new unique id
cell_id = f"id{cell_id_counter}"
while cell_id in existing_cell_ids:
cell_id_counter += 1
cell_id = f"id{cell_id_counter}"
existing_cell_ids.add(cell_id)
cell_id_counter += 1
return cell_id
def free_cell_id(cell_id):
"""record that a cell id is no longer in use"""
existing_cell_ids.remove(cell_id) If bookkeeping of current cell ids is not desirable, a 64-bit random id (11 chars without padding in b64) has a 10^-14 chance of collisions on 1000 cells, while an 8-char b64 string (48b) is still 10^-9. |
I don't quite follow this line. What's the downside? Defining new properties where they weren't previously is the main point of new minor revisions. Defining them where |
@choldgraf re: ID for notebooks, there is this thread (and links in it) jupyter/nbformat#148 |
yerp, always get confused... minor vs point, and of course it's in the title of issue. Adding an additional prop required, with no grace period, still seems rough. Perhaps a more measured rollout:
I think a lot of this becomes more tenuous in relation to multi-client support, to which the nbformat contributes a non-trivial amount of headache (see, ordered list of objects). An incrementer doesn't scale for people coming in-and-out of a "swarm" of editors, and still expecting things to "work" without a lot of heuristic approaches. With some substantial implementation complexity: consider an optional, notebook-level
As to a more multi-client robust, cross-language approach: yes, an algorithm can be encoded in a few hours, and I don't know of a standard that meets all our needs 😢. But again from the widely-implemented-but-not-a-standard stable, there is nanoid, which has currently 14 language implementations (notably not julia or R). They aren't all that pretty or short, though: e.g.
yeah, if trying to do multi-level, de-referenceable identity, it would be somewhat hurtful to not look at the JSON standards officially supported by some of the tools mentioned above (e.g. WADM).
Going further, the "shape" of the document can be validated with constraints, but is not as lightweight as jsmespath/json-e. This mode would probably also required a But... Let's burn the |
So general consensus sounds like let's make the id a unique string, which could accommodate a uuid if needed but defaults to something shorter and simpler with a fixed range of characters and a min/max length. I think I can easily adopt that into the actual JEP proposal. On that front, following https://github.com/jupyter/enhancement-proposals/blob/master/jupyter-enhancement-proposal-guidelines/jupyter-enhancement-proposal-guidelines.md#phase-1-pre-proposal who should be the designated Shepard for this pre-proposal? I'll keep trying to address concerns or adjust design constraints in this thread in the meantime until we have a go/no-go about me promoting to a full JEP. |
I think we can make this a requirement of changes needed to be made to address. I don't see this being insurmountable going forward.
Uniqueness within a single file seems easily achievable. It can't be done purely with json schema without contorting the format as @Carreau noted. We can enforce this at the library / application level without a lot of cause for concern. Specifically nbformat can be used to validate uniqueness and even provide opt-in to repair of uniqueness issues if a client makes a mistake. Without uniqueness you can't reliably build ontop of the abstraction.
This came up years ago in the id conversations (I can look for the public threads later). It's not unique, not required, and not constrained to certain characters which made building ontop of name unreliable and inconsistent. It also puts your display of information and programmatic references at odds with each other. Time since hasn't improved this story and it's parallel to other systems that needed a split on display info and identity info.
Annotating with cell number is not sufficient. You get disassociation when the cell gets moved within a notebook with an identifier to map against. You can do this with a metadata field (like Colab and Deepnote do) but it's inconsistent across services and causes fragmentation for people to build ontop of ids in a general manner. Noteable is going to be in the same boat, where we'll have to implement our own id if it's not part of the standard.
I'd much prefer to add a new required field in one go, and have the most common base libraries support roll-forward / roll-back. We have schema version for exactly this purpose and the proposed change is backwards / forward compatible here.
This can have a risk of collision or confusion if a user expects the ids to be sequential integers in the notebook permanently. Otherwise I don't see a strong reason to block a number string being used. I think having a random string of length X (6, or 8? characters) would give a better default expectation for how this field is intended to be used.
UUID strings (or binaries that can be stringified) is well supported cross-language. If we allow a format that could include those this is a universally usable pattern -- or suggest using uuidv4 and taking the last k characters. Psuedo-random strings is also pretty easy cross language if we specify the characters allowed.
Sounds good -- thanks for the inputs. |
I am in favor of seeing this move forward. In practice, it turns out to be difficult to implement a jupyter frontend without some notion of cell ids. JupyterLab even passes its cell id as metadata to the kernel, and some kernels leverage this to do cell dependency tracking. The only question I have is if it makes sense to go further and replace the "list of cells" structure by "list of cell ids" + "map from cell ids to cells". |
I am +1 on id for cell. I have to implement jupyerlab extrension needing ids for the cells and was sorry to not have an explicit field for that. Although uuid sounds like a good fit, a simple string would bring more flexibility. My understanding is that the spec defines the format (mandatory/optional, type), but it is up to the frontend to decide how to use that. |
Can you describe more how archival would be impacted? If they're set once and preserved as artifacts move around I don't quite follow how it'd impact persistence / recall since the id wouldn't change over time unless the application chooses to rewrite it. One of the intentions here is to help application recall as a notebook may change ordering of cells (in or out of said application) but wish to preserve association of a particular cell independent of position in a standard fashion. |
Over a really long time scale we can't rely on assumptions that things will be a certain way. A notebook artifact is going to be identified as whole object based on a SHA. Relative to the SHA there are cells in order. The SHA and cell ordering are two of the only things we can rely on for a really long period of time; and RDF/JSON-LD contexts.
The notebook currently stores application information in the metadata, if this is application level data then it is independent of the cells. Out of order notebooks are dangerous; they are common practice, but they are not sustainable. Hopefully, the community to can establish best practices to curb this. In the nbformat definitions, cells are ordered, they have ids already. Using a list in a schema implies ordering, out order notebooks don't fit that convention. I am not confident that in ten years out of order notebooks could still work while I have more confidence in ordered notebooks. If cells are out-of-order then maybe there is a way to use references and definitions to separate the id's from the linearity. This solution could allow for a mix of the old and the new. {
"cells": [
{ "$ref": "#/cell_definitions/uuid1" },
{ "$ref": "#/cell_definitions/uuid2" },
{ "metadata": ..., "source": ..., "cell_type": ... }
],
"cell_definitions": {
"uuid1": { "metadata": ..., "source": ..., "cell_type": ... }, # this is a cell type
"uuid2": { "metadata": ..., "source": ..., "cell_type": ... }
}
} |
Just a meta-point here. This conversation is fantastic with a lot of viewpoints. Is somebody willing to help serve as a shepherd to guide the conversation forward, make sure that voices are heard, summarize and synthesize, etc? I think that will make sure that we have the right amount of information to move forward.* Another question: which group of folks owns the decision on this one? I suppose it would be core maintainers on the *I'd do it but I am expecting a baby in -1 days :-) |
Chris you bring up a great point about who owns this decision. Under our
current governance model, I think the final decision on any JEP needs to be
the current Steering Council. I realize this isn't ideal - the new
governance model we are actively working on will change the decision making
body and allow for more flexibility.
…On Thu, Aug 13, 2020 at 9:27 AM Chris Holdgraf ***@***.***> wrote:
Just a meta-point here. This conversation is fantastic with a lot of
viewpoints. Is somebody willing to help serve as a *shepherd to guide the
conversation forward*, make sure that voices are heard, summarize and
synthesize, etc? I think that will make sure that we have the right amount
of information to move forward.*
Another question: *which group of folks owns the decision* on this one? I
suppose it would be core maintainers on the nbformat repository since
that's the reference implementation of the notebook spec?
*I'd do it but I am expecting a baby in -1 days :-)
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
<#61 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAGXUGC27UINJ2OJZBQYT3SAQH5PANCNFSM4P3V3K2Q>
.
--
Brian E. Granger
Principal Technical Program Manager, AWS AI Platform (brgrange@amazon.com)
On Leave - Professor of Physics and Data Science, Cal Poly
@ellisonbg on GitHub
|
I think that there's enough interest to begin iterating on a JEP and collaborating on the best technical approach. I suspect that the governance being in transition is secondary to the content of the JEP and whichever group is deciding at the time the JEP is done (Steering Council or TBD) can respond to the JEP. I'm willing to meet with interested folks on alternate weeks from the RTC meeting run by Saul. The preliminary JEP meeting for cell id/information could be Monday August 17th 9:30am - 10am Pacific. Here's a HackMD to get folks started: https://hackmd.io/@Y6xjRiXFRUmwV7lDeM-5nQ/rJ-VFemfv/edit |
I'm suggesting a meeting as a better way to share perspectives and collaborate than doing this all via text on an issue. |
Thanks Carol, I think that is a good next step. I likely won't be able to
attend, but will continue to watch the effort here. I think this is
important work and am in favor of seeing it move forward. I also think that
enough groups have already run into this issue that we should stick to the
pattern as those groups have run into it and not increase the scope of the
proposal to cover other usage cases. In either case, the JEP itself should
be explicit about the intended scope of the change, and its usage cases,
and other options and usage cases that were considered (and their pros and
cons).
…On Thu, Aug 13, 2020 at 10:41 AM Carol Willing ***@***.***> wrote:
I'm suggesting a meeting as a better way to share perspectives and
collaborate than doing this all via text on an issue.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#61 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAGXUF4CEITX6Z72YEMSTLSAQQWLANCNFSM4P3V3K2Q>
.
--
Brian E. Granger
Principal Technical Program Manager, AWS AI Platform (brgrange@amazon.com)
On Leave - Professor of Physics and Data Science, Cal Poly
@ellisonbg on GitHub
|
Thanks @ellisonbg. I completely agree re: scope. |
I've attached a Zoom call in the meeting Carol started. Planning for 30 minutes on Monday it should be open to anyone to join (with the join password in the doc). |
I will not be able to join any of these meetings due to pending 👶 but I am supportive of the idea and those who are interested in pushing this forward, and confident that we can come to decision in an inclusive and productive manner ✨ |
Best of luck @choldgraf ! ... DM me with what my future will be in 2 months! I'll take silence as a sign of much restful sleep 😉 |
Thanks folks that attended! We're planning to repeat the meeting in 2 weeks at the same time and get a draft of the actual JEP with all the feedback so far included as prep for that session. Notes are captured in https://hackmd.io/AkuHK5lPQ5-0BBTF8-SPzQ (I'll need to change up the call setup for next time as there were technical difficulties with the link I gave). |
Thanks for the update Matt, and thanks for folks who attended. Just read
through - looks like a good discussion that is getting to the core
questions.
…On Mon, Aug 17, 2020 at 10:26 AM Matthew Seal ***@***.***> wrote:
Thanks folks that attended! We're planning to repeat the meeting in 2
weeks at the same time and get a draft of the actual JEP with all the
feedback so far included as prep for that session. Notes are captured in
https://hackmd.io/AkuHK5lPQ5-0BBTF8-SPzQ (I'll need to change up the call
setup for next time as there were technical difficulties with the link I
gave).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#61 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAGXUDQGIU425LMURGGPQ3SBFR4ZANCNFSM4P3V3K2Q>
.
--
Brian E. Granger
Principal Technical Program Manager, AWS AI Platform (brgrange@amazon.com)
On Leave - Professor of Physics and Data Science, Cal Poly
@ellisonbg on GitHub
|
I missed the meeting! Thank you all for pushing this forward so we can all start jumping out of our backchannel ways of doing this. |
And this time the zoom link should work for folks |
Biweekly meeting was attended by @MSeal and me. Here are the agenda/minutes:
|
bummed i missed this. how do we stay up to date with events like this? |
Sorry you missed this @tonyfast. Matt had mentioned above. I don't know the best way :( |
@willingc (Sorry to hijack this issue with a question) Some PR in the |
@echarles My understanding is that those JEP proposed would need to be sent to the Steering Council for pronouncement (approval, rework, reject). |
Thx for the answer @willingc. I have more questions like |
@echarles In case you haven't seen, https://github.com/jupyter/enhancement-proposals/blob/master/jupyter-enhancement-proposal-guidelines/jupyter-enhancement-proposal-guidelines.md This is likely the best procedure doc that I have seen. Feel free to see if there is another issue open or create a new issue for further discussion. |
This is a Pre-proposal for adding a cell
id
field to the Jupyter Notebook Format to be included in the next minor version bump.Why
There's a range of applications that need a mechanism for recalling particular cells across mutations of the notebooks inside and outside of a particular notebook session. Some examples include:
Traditionally users have used custom
tags
on cells to track particular use-cases for cell activity. This works well for things like identifying the class of content within a cell (e.g. papermillparameters
cell tag) but not for activities where an application may want to dynamically associate a cell to an action or resource. Additionally not having a cell id field has led to applications generating ids in different ways (e.g.metadata["cell_id"] = "some-string"
vsmetadata[application_name]["id"] = cell_guuid
).Most resource applications include ids as a standard part of the resource / sub-resources. This proposal is not touching on an overall notebook id field, but the sub-resource of cells in this instance are oftentimes treated relationally and adding an id for this field would help with improving the quality of abstractions built on-top of notebooks.
Outline
This change would be whole encompassed by adding an
id
field to each cell type in the 4.4 json_schema. Specifically the raw_cell, markdown, and code_cell required sections would add theid
field with the following schema:The
uuid
type was recently added to json-schema referencing RFC.4122. If needed for older library implementations one can also use astr
format with a regex pattern match.This field would always be required for any future nbformat versions (4.5+). The field would not be optional to avoid applications having to conditionally check if an id is present or not. This is an important aspect to the change as adding an optional field would lead to partial implementation in applications and difficulty in having consistent experiences with build ontop of the id change. Older formats can be loaded by nbformat and trivially updated to 4.5 format by running
uuid.uuid4()
to populate the new field, The change would go into effect once the nbformat PR is submitted, merged, and released with a new schema.Why a JEP
These two aspects defined as requiring a JEP are both met with this propsal:
One of the examples is literally this proposal, so that seem fortuitous towards formalizing a JEP 😄
Who'd be Interested
The 10 assignees + @captainsafia, @ivanov @yuvipanda and probably several others I missed. Github only allows 10 assignees and this topic has come up for the past couple years in conversation with most of the community so I am including most of the active people I can think of.
The text was updated successfully, but these errors were encountered: