-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hash Based Cache #8
Conversation
- wrap commit exceptions in `CachingError` - Add `NbBundle` for passing notebooks (and associated data) to/from cache - Add `to_str` option to `diff_staged_notebook` - Add `iter_notebooks_to_exec`
Thanks for this great summary. Curious if you could describe the need to move away from git. (not saying it isn't the right idea, just curious what the pros/cons are) |
A few things; (a) because it makes the code simpler, with fewer dependencies, (b) you need to compute the special notebook hashes independently anyway (including only the code cells and kernel), which git won't do, (c) this approach allows for multiple potential outputs for the same notebook to be stored (e.g. if you have different git branches), which you wouldn't really be able to do with git, unless you somehow 'synced' it with the parent git repo. |
@chrisjsewell definitely, looks pretty good! Can you explain the design decision of having a separate "staged" state, as opposed to always executing immediately? It seems reasonable, but I cannot quite identify the use case, and I'd like to understand better. What controls the execution path/context? For the book project it will become important (e.g. there's a contract that the notebooks may refer to anything in BTW: if the project targets python ≥ 3.5, I wholeheartedly recommend |
Another question: does the cache check existence exclusively based on the hash, or rather on the combination of hash + URI? |
If you look in the code, you'll see I always use |
Exclusively on the hash. That was the goal right? |
Ah, I see: one of the files uses os.path, and another pathlib.Path.
Absolutely. Just wanted to check if I understand correctly. |
Compared to the minimal implementation that I outlined in #7 this has:
Overall I feel that it strikes a good balance between complexity and feature set. The extra dependencies are somewhat commonly used to not be a big nuisance. Plus, one can provide a drop-in replacement if need be. |
Another question: the cache is not the source of truth wrt contents of markdown cells? For example, if I query it for a notebook that has the same code cells as one stored in cache, but different markdown cells, then the cache is supposed to be unreliable wrt markdown cells. Is that correct? |
Yeh the URIs are there purely for introspection of the cache.
Then main use case I guess is for manual control of the execution flow. It can be good to see what will get executed before you execute it; so you add a number of notebooks as staged, then you can inspect which ones will actually be re-executed before running the execution. I could even envisage getting fancy and staging notebooks with different execution parameters before firing them all off to the executor. |
Yes that's the current design; cell outputs and hence code execution should only be reliant on changes to source code or execution specific metadata (i.e. |
Nothing really yet, there is just a basic 'executor' implemented that isolates a notebook in a tempfolder and sets the |
Yes, makes perfect sense. Perhaps it may even be cleaner to strip the cached notebooks of all markdown, so that there's no risk of using it. More thoughts:
|
Agreed [edit: this has been implemented in 6c0083d] |
I guess this should be in the notebook, so that it can be included in the hash. |
@mmcky will want to take a glance at this before the meeting on Monday imo 😁 |
These are additional files output by the notebook during execution. These files must be in the same folder as the notebook, or a subfolder of it.
Hold on, you're saying that not only the files are opened, but also there's no way to infer their path in case they actually are available on disk? That would likely result in performance degradation (read+write instead of copy). |
Well I've certainly used sqlalchemy/sqlite, for multi-threading (using the older But more to the point though; the cache may not necessarily be involved in the actual parallel execution phase. It may be that it just provides everything to the 'executor', which does its thing, then provides back all the completed notebooks + artefacts at the end. |
Ah, great point. Even if cache is involved, there could be a blocking |
Still another question about the API (since this PR introduces essentially all of it): Should there be a way to query outputs of a single cell? This will typically require reading the complete notebook from disk, reading one cell, and discarding the rest. Of course the users may also ask for the complete notebook from the start, but providing a likely inefficient API endpoint increases the maintenance, and promotes bad usage patterns. |
Well you can 'abuse' the API and get the physical path, and it would take very little code to change the API to be 'physical path' based. But as I mentioned, then your blocking off any possibility for your cache to be non-local. It just depends on what magnitude this degradation will be in relation to the rest of the process. Any pointers to actual timings of the difference in |
As in this is what you want? Because thats already in the API: @abstractmethod
def get_commit_codecell(self, pk: int, index: int) -> nbf.NotebookNode:
"""Return a code cell from a committed notebook.
NOTE: the index **only** refers to the list of code cells, e.g.
`[codecell_0, textcell_1, codecell_2]`
would map {0: codecell_0, 1: codecell_2}
"""
pass |
That strongly depends on the file system. Several modern ones (BTRFS, ZFS) do copy on write, in which case there would be no actual copying involved.
I realize that (as an aside, there would also be workarounds for that). However my main point is that there is a trade-off involved, also for the cache consumer (if they only want to copy the file, they now also need to read it and write back). |
I am rather wondering if it could be better to remove it for now, until there is a definite use case. |
Oh yeh absolutely. Perhaps a middleground would be path access with a context manager; as in "this filepath is only guaranteed to exist within this with cache.get_artifacts_path(pk=1) as path:
shutil.copytree(path, "wherever") |
Yeh that's not a deal breaker. It mainly spawned from the earlier |
the one thing that should be made clear when recieving a notebook from the cache, is that the cell indices may not be the same as the input notebook. As in the number and order of code cells will be the same, but there may be differing numbers of markdown cells do that potentially At the moment I've replaced all the markdown cells for cached notebooks with |
Actually it would probably be good to supply a merge function, that injected the updated outputs into an input notebook. Then you could use it as the 'source of truth' for everything and make it available for download etc. |
More API questions: right now staging/committing requires a URI, which means that in-memory notebooks may not be served to cache. Should it be |
No there's For the staging, at present I am not actually copying anything into the cache; just keeping a reference of the URI's in the database. I wasn't sure if it was worth the extra read/write/copy to do this, or just give the executor the uris of the notebooks to fetch (and possibly uris of assets). |
- Commented out get_commit_codecell until use case established - Added `commit_artefacts_temppath` - Fully remove non-code cells from commited notebooks
Looks good, thanks for bearing with me :) |
Oh no its great feedback 👍 |
Only main missing component now is execution assets. I might just merge this soon and put that in an issue for later. |
Merging now, but feel free to continue discussion here, or open separate issues |
Following from discussion in #6 I have made some changes to the proposed cache, to make it hash based and also remove the dependency on
git
.I won't go through the whole API again, you can look at the code yourself. But here is an example of the demo CLI:
You can commit notebooks straight into the cache (I have implemented assets/artefacts just yet). When committing a check will be made that the notebooks look to have been executed correctly, i.e. the cell execution counts go sequentially up from 1.
Or to skip validation:
Once you've committed some notebooks, you can look at the 'commit records' for what has been cached:
Each notebook is hashed (code cells and kernel spec only), which is used to compare against 'staged' notebooks. Multiple hashes for the same URI can be added (the URI is just there for description) and the size of the cache is limited (current default 1000) so that, at this size, the last accessed records begin to be deleted. You can remove cached records by the Primary Key (PK).
You can also diff any of the commit records with any (external) notebook:
If I stage some notebooks for execution, then you can list them to see which have existing records in the cache (by hash) and which will require execution:
and finally, you can run a basic execution of the required notebooks:
I think that aligns more with what you had in mind @akhmerov?