Hash Based Cache #8

chrisjsewell · 2020-02-23T21:16:03Z

Following from discussion in #6 I have made some changes to the proposed cache, to make it hash based and also remove the dependency on git.

I won't go through the whole API again, you can look at the code yourself. But here is an example of the demo CLI:

$ jcache -h
Usage: jcache [OPTIONS] COMMAND [ARGS]...

  The command line interface of jupyter-cache.

Options:
  -v, --version     Show the version and exit.
  -p, --cache-path  Print the current cache path and exit.
  -h, --help        Show this message and exit.

Commands:
  cat-artifact    Print the contents of a commit artefact.
  clear           Clear the cache completely.
  commit-limit    Change the commit limit of the cache (default: 1000).
  commit-nb       Commit a notebook that has already been executed, with...
  commit-nbs      Commit notebook(s) that have already been executed.
  diff-nb         Print a diff of a notebook to one stored in the cache.
  execute         Execute outdated notebooks.
  list-commits    List committed notebook records in the cache.
  list-staged     List notebooks staged for possible execution.
  remove-commits  Remove notebook commit(s) from the cache by Primary Key.
  show-commit     Show details of a committed notebook in the cache.
  stage-nbs       Stage notebook(s) for execution.
  unstage-nbs     Unstage notebook(s) for execution.

You can commit notebooks straight into the cache (I have implemented assets/artefacts just yet). When committing a check will be made that the notebooks look to have been executed correctly, i.e. the cell execution counts go sequentially up from 1.

$ jcache commit-nbs tests/notebooks/basic.ipynb 
Cache path: /Users/cjs14/GitHub/sandbox/.jupyter_cache
The cache does not yet exist, do you want to create it? [y/N]: y
Committing: /Users/cjs14/GitHub/sandbox/tests/notebooks/basic.ipynb
Validity Error: Expected cell 1 to have execution_count 1 not 2
The notebook may not have been executed, continue committing? [y/N]: y
Success!

Or to skip validation:

$ jcache commit-nbs --no-validate tests/notebooks/*
Committing: /Users/cjs14/GitHub/sandbox/tests/notebooks/basic.ipynb
Committing: /Users/cjs14/GitHub/sandbox/tests/notebooks/basic_failing.ipynb
Committing: /Users/cjs14/GitHub/sandbox/tests/notebooks/basic_unrun.ipynb
Committing: /Users/cjs14/GitHub/sandbox/tests/notebooks/complex_outputs.ipynb
Success!

Once you've committed some notebooks, you can look at the 'commit records' for what has been cached:

$ jcache list-commits --hashkeys
  PK  URI                    Created           Accessed          Hashkey
----  ---------------------  ----------------  ----------------  --------------------------------
   4  complex_outputs.ipynb  2020-02-23 20:33  2020-02-23 20:33  800c4a057730a55a384cfe579e3850aa
   3  basic_unrun.ipynb      2020-02-23 20:33  2020-02-23 20:33  818f3412b998fcf4fe9ca3cca11a3fc3
   2  basic_failing.ipynb    2020-02-23 20:33  2020-02-23 20:33  72859c2bf1e12f35f30ef131f0bef320

Each notebook is hashed (code cells and kernel spec only), which is used to compare against 'staged' notebooks. Multiple hashes for the same URI can be added (the URI is just there for description) and the size of the cache is limited (current default 1000) so that, at this size, the last accessed records begin to be deleted. You can remove cached records by the Primary Key (PK).

$ jcache remove-commits 3
Removing PK = 3
Success!

You can also diff any of the commit records with any (external) notebook:

$ jcache diff-nb 2 tests/notebooks/basic.ipynb 
nbdiff
--- committed pk=2
+++ other: /Users/cjs14/GitHub/sandbox/tests/notebooks/basic.ipynb
## inserted before nb/cells/1:
+  code cell:
+    execution_count: 2
+    source:
+      a=1
+      print(a)
+    outputs:
+      output 0:
+        output_type: stream
+        name: stdout
+        text:
+          1

## deleted nb/cells/1:
-  code cell:
-    source:
-      raise Exception('oopie')

If I stage some notebooks for execution, then you can list them to see which have existing records in the cache (by hash) and which will require execution:

$ jcache stage-nbs tests/notebooks/*
Staging: /Users/cjs14/GitHub/sandbox/tests/notebooks/basic.ipynb
Staging: /Users/cjs14/GitHub/sandbox/tests/notebooks/basic_failing.ipynb
Staging: /Users/cjs14/GitHub/sandbox/tests/notebooks/basic_unrun.ipynb
Staging: /Users/cjs14/GitHub/sandbox/tests/notebooks/complex_outputs.ipynb
Success!

$ jcache list-staged
  PK  URI                                    Created             Commit Pk
----  -------------------------------------  ----------------  -----------
   4  tests/notebooks/complex_outputs.ipynb  2020-02-23 20:48            4
   3  tests/notebooks/basic_unrun.ipynb      2020-02-23 20:48
   2  tests/notebooks/basic_failing.ipynb    2020-02-23 20:48            2
   1  tests/notebooks/basic.ipynb            2020-02-23 20:48

and finally, you can run a basic execution of the required notebooks:

$ jcache execute
Executing: /Users/cjs14/GitHub/sandbox/tests/notebooks/basic.ipynb
Success: /Users/cjs14/GitHub/sandbox/tests/notebooks/basic.ipynb
Executing: /Users/cjs14/GitHub/sandbox/tests/notebooks/basic_unrun.ipynb
Success: /Users/cjs14/GitHub/sandbox/tests/notebooks/basic_unrun.ipynb
Finished!

$ jcache list-staged
  PK  URI                                    Created             Commit Pk
----  -------------------------------------  ----------------  -----------
   5  tests/notebooks/basic.ipynb            2020-02-23 20:57            5
   4  tests/notebooks/complex_outputs.ipynb  2020-02-23 20:48            4
   3  tests/notebooks/basic_unrun.ipynb      2020-02-23 20:48            5
   2  tests/notebooks/basic_failing.ipynb    2020-02-23 20:48            2

I think that aligns more with what you had in mind @akhmerov?

- wrap commit exceptions in `CachingError` - Add `NbBundle` for passing notebooks (and associated data) to/from cache - Add `to_str` option to `diff_staged_notebook` - Add `iter_notebooks_to_exec`

choldgraf · 2020-02-23T21:34:36Z

Thanks for this great summary. Curious if you could describe the need to move away from git. (not saying it isn't the right idea, just curious what the pros/cons are)

chrisjsewell · 2020-02-23T21:55:30Z

Curious if you could describe the need to move away from git.

A few things; (a) because it makes the code simpler, with fewer dependencies, (b) you need to compute the special notebook hashes independently anyway (including only the code cells and kernel), which git won't do, (c) this approach allows for multiple potential outputs for the same notebook to be stored (e.g. if you have different git branches), which you wouldn't really be able to do with git, unless you somehow 'synced' it with the parent git repo.

akhmerov · 2020-02-23T22:01:09Z

@chrisjsewell definitely, looks pretty good!

Can you explain the design decision of having a separate "staged" state, as opposed to always executing immediately? It seems reasonable, but I cannot quite identify the use case, and I'd like to understand better.

What controls the execution path/context? For the book project it will become important (e.g. there's a contract that the notebooks may refer to anything in ./content, and generate outputs in the current folder.

BTW: if the project targets python ≥ 3.5, I wholeheartedly recommend pathlib.Path for dealing with files. It is to os.path what click is to argparse.

akhmerov · 2020-02-23T22:02:17Z

Another question: does the cache check existence exclusively based on the hash, or rather on the combination of hash + URI?

chrisjsewell · 2020-02-23T22:05:13Z

BTW: if the project targets python ≥ 3.5, I wholeheartedly recommend pathlib.Path for dealing with files. It is to os.path what click is to argparse.

If you look in the code, you'll see I always use pathlib.Path 😄 In fact I think I've made it ≥ 3.6 by using f-strings, but that's another matter

chrisjsewell · 2020-02-23T22:07:38Z

Another question: does the cache check existence exclusively based on the hash, or rather on the combination of hash + URI?

Exclusively on the hash. That was the goal right?

akhmerov · 2020-02-23T22:10:29Z

If you look in the code, you'll see I always use pathlib.Path :) In fact I think I've made it ≥ 3.6 by using f-strings, but that's another matter

Ah, I see: one of the files uses os.path, and another pathlib.Path.

Exclusively on the hash. That was the goal right?

Absolutely. Just wanted to check if I understand correctly.

akhmerov · 2020-02-23T22:24:53Z

Compared to the minimal implementation that I outlined in #7 this has:

Created/accessed timestamps, useful for pruning
URI, also handy if we want to search for all versions of one notebook
Staged state
Diffs, also seems handy
sqlite/sqlalchemy for an extra dependency

Overall I feel that it strikes a good balance between complexity and feature set. The extra dependencies are somewhat commonly used to not be a big nuisance. Plus, one can provide a drop-in replacement if need be.

akhmerov · 2020-02-23T22:27:09Z

Another question: the cache is not the source of truth wrt contents of markdown cells?

For example, if I query it for a notebook that has the same code cells as one stored in cache, but different markdown cells, then the cache is supposed to be unreliable wrt markdown cells. Is that correct?

chrisjsewell · 2020-02-23T22:28:38Z

Absolutely. Just wanted to check if I understand correctly.

Yeh the URIs are there purely for introspection of the cache.

Can you explain the design decision of having a separate "staged" state

Then main use case I guess is for manual control of the execution flow. It can be good to see what will get executed before you execute it; so you add a number of notebooks as staged, then you can inspect which ones will actually be re-executed before running the execution. I could even envisage getting fancy and staging notebooks with different execution parameters before firing them all off to the executor.

chrisjsewell · 2020-02-23T22:36:26Z

Another question: the cache is not the source of truth wrt contents of markdown cells?

Yes that's the current design; cell outputs and hence code execution should only be reliant on changes to source code or execution specific metadata (i.e. kernelspec), so markdown is ignored. Does that make sense?

chrisjsewell · 2020-02-23T22:43:57Z

What controls the execution path/context? For the book project it will become important (e.g. there's a contract that the notebooks may refer to anything in ./content, and generate outputs in the current folder.

Nothing really yet, there is just a basic 'executor' implemented that isolates a notebook in a tempfolder and sets the cwd to that tempfolder. This is just a placeholder though for more careful consideration over these questions.

akhmerov · 2020-02-23T22:44:37Z

markdown is ignored. Does that make sense?

Yes, makes perfect sense. Perhaps it may even be cleaner to strip the cached notebooks of all markdown, so that there's no risk of using it.

More thoughts:

What should controls whether the execution should tolerate errors? Would this be an API option (--errors-ok), a notebook metadata fields, a cell metadata field, or something else?
This cache implementation might be useful to integrate with jupyter contents managers, so that the users see the notebook loaded immediately together with its outputs if it's cached.

chrisjsewell · 2020-02-23T22:47:06Z

Perhaps it may even be cleaner to strip the cached notebooks of all markdown, so that there's no risk of using it.

Agreed [edit: this has been implemented in 6c0083d]

chrisjsewell · 2020-02-23T22:57:01Z

What should controls whether the execution should tolerate errors? Would this be an API option (--errors-ok), a notebook metadata fields, a cell metadata field, or something else?

I guess this should be in the notebook, so that it can be included in the hash.

choldgraf · 2020-02-24T00:40:21Z

@mmcky will want to take a glance at this before the meeting on Monday imo 😁

These are additional files output by the notebook during execution. These files must be in the same folder as the notebook, or a subfolder of it.

akhmerov · 2020-02-24T17:27:56Z

As in there is no assumption that data in the cache has a physical path on disk. That was the idea behind parsing out io.BufferedReaders, rather than just paths to the artefacts.

Hold on, you're saying that not only the files are opened, but also there's no way to infer their path in case they actually are available on disk? That would likely result in performance degradation (read+write instead of copy).

chrisjsewell · 2020-02-24T17:28:02Z

right now sqlalchemy does not support asyncio. At the same time, sqlite doesn't handle multiple writers very well.

Well I've certainly used sqlalchemy/sqlite, for multi-threading (using the older concurrent.futures approach) and haven't seen any issues. Also a project I know well; AiiDA uses sqlalchemy/postgres for a multi-processing application (the great thing about sqlalchemy/django is super easy to switch to a different engine if necessary). At any rate, its something to consider, but they are certainly better than trying to work with JSON files!

But more to the point though; the cache may not necessarily be involved in the actual parallel execution phase. It may be that it just provides everything to the 'executor', which does its thing, then provides back all the completed notebooks + artefacts at the end.

akhmerov · 2020-02-24T17:33:55Z

But more to the point though; the cache may not necessarily be involved in the actual parallel execution phase.

Ah, great point. Even if cache is involved, there could be a blocking execute_all operation, which internally uses asyncio loop to parallelize the execution and doesn't write into the database.

akhmerov · 2020-02-24T17:42:07Z

Still another question about the API (since this PR introduces essentially all of it):

Should there be a way to query outputs of a single cell? This will typically require reading the complete notebook from disk, reading one cell, and discarding the rest. Of course the users may also ask for the complete notebook from the start, but providing a likely inefficient API endpoint increases the maintenance, and promotes bad usage patterns.

chrisjsewell · 2020-02-24T17:42:43Z

Hold on, you're saying that not only the files are opened, but also there's no way to infer their path in case they actually are available on disk? That would likely result in performance degradation (read+write instead of copy).

Well you can 'abuse' the API and get the physical path, and it would take very little code to change the API to be 'physical path' based. But as I mentioned, then your blocking off any possibility for your cache to be non-local. It just depends on what magnitude this degradation will be in relation to the rest of the process. Any pointers to actual timings of the difference in rb->wb over shutil.copyfile? At the end of the day shutil.copyfile I imagine still uses some form of readwrite, just closer to the metal.

chrisjsewell · 2020-02-24T17:46:31Z

Should there be a way to query outputs of a single cell?

As in this is what you want? Because thats already in the API:

    @abstractmethod
    def get_commit_codecell(self, pk: int, index: int) -> nbf.NotebookNode:
        """Return a code cell from a committed notebook.

        NOTE: the index **only** refers to the list of code cells, e.g.
        `[codecell_0, textcell_1, codecell_2]`
        would map {0: codecell_0, 1: codecell_2}
        """
        pass

akhmerov · 2020-02-24T17:53:52Z

At the end of the day shutil.copyfile I imagine still uses some form of readwrite, just closer to the metal.

That strongly depends on the file system. Several modern ones (BTRFS, ZFS) do copy on write, in which case there would be no actual copying involved.

But as I mentioned, then your blocking off any possibility for your cache to be non-local.

I realize that (as an aside, there would also be workarounds for that). However my main point is that there is a trade-off involved, also for the cache consumer (if they only want to copy the file, they now also need to read it and write back).

akhmerov · 2020-02-24T17:54:56Z

Should there be a way to query a single code cell? ...

As in this is what you want? Because thats already in the API:

I am rather wondering if it could be better to remove it for now, until there is a definite use case.

chrisjsewell · 2020-02-24T18:04:08Z

However my main point is that there is a trade-off involved

Oh yeh absolutely. Perhaps a middleground would be path access with a context manager; as in "this filepath is only guaranteed to exist within this with statement".

with cache.get_artifacts_path(pk=1) as path:
   shutil.copytree(path, "wherever")

chrisjsewell · 2020-02-24T18:09:06Z

I am rather wondering if it could be better to remove it for now

Yeh that's not a deal breaker. It mainly spawned from the earlier jupter-sphinx proposal I made, whereby you were gathering all the code from the docutils.nodes and leaving pointers to the cells, before executing all the code, then adding the outputs to the nodes in a post-transform.

chrisjsewell · 2020-02-24T18:43:30Z

the users may also ask for the complete notebook from the start

the one thing that should be made clear when recieving a notebook from the cache, is that the cell indices may not be the same as the input notebook. As in the number and order of code cells will be the same, but there may be differing numbers of markdown cells do that potentially nb_in.cells[3] != nb_cache.cells[3]

At the moment I've replaced all the markdown cells for cached notebooks with cell.source = "" and cell.metadata = {}. Why I didn't completely remove them was so that diffing would show the same cell indices as the source code, but I realise this can't be strictly guaranteed, so I think I'll just remove them completely.

chrisjsewell · 2020-02-24T18:53:37Z

the one thing that should be made clear when recieving a notebook from the cache, is that the cell indices may not be the same as the input notebook.

Actually it would probably be good to supply a merge function, that injected the updated outputs into an input notebook. Then you could use it as the 'source of truth' for everything and make it available for download etc.

akhmerov · 2020-02-24T19:18:20Z

More API questions: right now staging/committing requires a URI, which means that in-memory notebooks may not be served to cache. Should it be file_or_stream instead? Or even a NotebookNode object?

chrisjsewell · 2020-02-24T19:25:59Z

More API questions: right now staging/committing requires a URI

No there's commit_notebook_bundle which exactly parses a NotebookNode wrapped in a NbBundleIn that allows you to also specify a URI and artifacts. commit_notebook_file just reads in the notebook file then calls commit_notebook_bundle.

For the staging, at present I am not actually copying anything into the cache; just keeping a reference of the URI's in the database. I wasn't sure if it was worth the extra read/write/copy to do this, or just give the executor the uris of the notebooks to fetch (and possibly uris of assets).

- Commented out get_commit_codecell until use case established - Added `commit_artefacts_temppath` - Fully remove non-code cells from commited notebooks

chrisjsewell · 2020-02-24T19:52:02Z

@akhmerov see 763e38f to address some of your points

akhmerov · 2020-02-24T20:04:33Z

Looks good, thanks for bearing with me :)

chrisjsewell · 2020-02-24T20:06:37Z

Looks good, thanks for bearing with me :)

Oh no its great feedback 👍

chrisjsewell · 2020-02-24T21:36:16Z

Only main missing component now is execution assets.

I might just merge this soon and put that in an issue for later.

chrisjsewell · 2020-02-25T06:05:04Z

Merging now, but feel free to continue discussion here, or open separate issues

chrisjsewell added 11 commits February 21, 2020 06:18

Add initial working example

7dc4438

Updates (see description)

89a131e

- wrap commit exceptions in `CachingError` - Add `NbBundle` for passing notebooks (and associated data) to/from cache - Add `to_str` option to `diff_staged_notebook` - Add `iter_notebooks_to_exec`

Add basic execution code and testing

2f825e2

Finalise(ish) cache API interface

97edd27

Add CLI

9038e8b

Fix docstring

d4c8934

Add remove-nbs and execute CLI commands

6cc44dc

Add more CLI tests

75b6850

Implement hash based cache

b484b73

test fix

76f7c6f

Minor improvements

d388b00

chrisjsewell added 3 commits February 24, 2020 16:42

Commit/diff notebooks without non-code source text and metadata

6c0083d

Add unstage all option

9c15bbe

Implement artifacts storage/retrieval

af43ec6

These are additional files output by the notebook during execution. These files must be in the same folder as the notebook, or a subfolder of it.

Several updates (see description)

763e38f

- Commented out get_commit_codecell until use case established - Added `commit_artefacts_temppath` - Fully remove non-code cells from commited notebooks

chrisjsewell added 5 commits February 25, 2020 07:53

Add merge_commit_into_match method

b9408bf

add merge_match_into_file method

f3b8331

docstring fix

cb4f298

Add artefact gather toJupyterExecutorBasic

5b8cf65

Create external_output.ipynb

302acc6

Add CLI example to README

a047ec1

chrisjsewell mentioned this pull request Feb 25, 2020

Implement staging execution assets. #10

Closed

chrisjsewell marked this pull request as ready for review February 25, 2020 06:05

chrisjsewell merged commit 1b19c75 into master Feb 25, 2020

bsipocz deleted the hash branch September 13, 2024 16:49

Hash Based Cache #8

Hash Based Cache #8

Conversation

chrisjsewell commented Feb 23, 2020 • edited Loading

choldgraf commented Feb 23, 2020

chrisjsewell commented Feb 23, 2020

akhmerov commented Feb 23, 2020

akhmerov commented Feb 23, 2020

chrisjsewell commented Feb 23, 2020

chrisjsewell commented Feb 23, 2020

akhmerov commented Feb 23, 2020

akhmerov commented Feb 23, 2020

akhmerov commented Feb 23, 2020

chrisjsewell commented Feb 23, 2020

chrisjsewell commented Feb 23, 2020 • edited Loading

chrisjsewell commented Feb 23, 2020 • edited Loading

akhmerov commented Feb 23, 2020

chrisjsewell commented Feb 23, 2020 • edited Loading

chrisjsewell commented Feb 23, 2020

choldgraf commented Feb 24, 2020

akhmerov commented Feb 24, 2020

chrisjsewell commented Feb 24, 2020

akhmerov commented Feb 24, 2020

akhmerov commented Feb 24, 2020

chrisjsewell commented Feb 24, 2020 • edited Loading

chrisjsewell commented Feb 24, 2020

akhmerov commented Feb 24, 2020

akhmerov commented Feb 24, 2020

chrisjsewell commented Feb 24, 2020 • edited Loading

chrisjsewell commented Feb 24, 2020 • edited Loading

chrisjsewell commented Feb 24, 2020 • edited Loading

chrisjsewell commented Feb 24, 2020

akhmerov commented Feb 24, 2020

chrisjsewell commented Feb 24, 2020 • edited Loading

chrisjsewell commented Feb 24, 2020

akhmerov commented Feb 24, 2020

chrisjsewell commented Feb 24, 2020

chrisjsewell commented Feb 24, 2020

chrisjsewell commented Feb 25, 2020

chrisjsewell commented Feb 23, 2020 •

edited

Loading

chrisjsewell commented Feb 23, 2020 •

edited

Loading

chrisjsewell commented Feb 23, 2020 •

edited

Loading

chrisjsewell commented Feb 23, 2020 •

edited

Loading

chrisjsewell commented Feb 24, 2020 •

edited

Loading

chrisjsewell commented Feb 24, 2020 •

edited

Loading

chrisjsewell commented Feb 24, 2020 •

edited

Loading

chrisjsewell commented Feb 24, 2020 •

edited

Loading

chrisjsewell commented Feb 24, 2020 •

edited

Loading