Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hash Based Cache #8

Merged
merged 24 commits into from
Feb 25, 2020
Merged

Hash Based Cache #8

merged 24 commits into from
Feb 25, 2020

Conversation

chrisjsewell
Copy link
Member

@chrisjsewell chrisjsewell commented Feb 23, 2020

Following from discussion in #6 I have made some changes to the proposed cache, to make it hash based and also remove the dependency on git.

I won't go through the whole API again, you can look at the code yourself. But here is an example of the demo CLI:

$ jcache -h
Usage: jcache [OPTIONS] COMMAND [ARGS]...

  The command line interface of jupyter-cache.

Options:
  -v, --version     Show the version and exit.
  -p, --cache-path  Print the current cache path and exit.
  -h, --help        Show this message and exit.

Commands:
  cat-artifact    Print the contents of a commit artefact.
  clear           Clear the cache completely.
  commit-limit    Change the commit limit of the cache (default: 1000).
  commit-nb       Commit a notebook that has already been executed, with...
  commit-nbs      Commit notebook(s) that have already been executed.
  diff-nb         Print a diff of a notebook to one stored in the cache.
  execute         Execute outdated notebooks.
  list-commits    List committed notebook records in the cache.
  list-staged     List notebooks staged for possible execution.
  remove-commits  Remove notebook commit(s) from the cache by Primary Key.
  show-commit     Show details of a committed notebook in the cache.
  stage-nbs       Stage notebook(s) for execution.
  unstage-nbs     Unstage notebook(s) for execution.

You can commit notebooks straight into the cache (I have implemented assets/artefacts just yet). When committing a check will be made that the notebooks look to have been executed correctly, i.e. the cell execution counts go sequentially up from 1.

$ jcache commit-nbs tests/notebooks/basic.ipynb 
Cache path: /Users/cjs14/GitHub/sandbox/.jupyter_cache
The cache does not yet exist, do you want to create it? [y/N]: y
Committing: /Users/cjs14/GitHub/sandbox/tests/notebooks/basic.ipynb
Validity Error: Expected cell 1 to have execution_count 1 not 2
The notebook may not have been executed, continue committing? [y/N]: y
Success!

Or to skip validation:

$ jcache commit-nbs --no-validate tests/notebooks/*
Committing: /Users/cjs14/GitHub/sandbox/tests/notebooks/basic.ipynb
Committing: /Users/cjs14/GitHub/sandbox/tests/notebooks/basic_failing.ipynb
Committing: /Users/cjs14/GitHub/sandbox/tests/notebooks/basic_unrun.ipynb
Committing: /Users/cjs14/GitHub/sandbox/tests/notebooks/complex_outputs.ipynb
Success!

Once you've committed some notebooks, you can look at the 'commit records' for what has been cached:

$ jcache list-commits --hashkeys
  PK  URI                    Created           Accessed          Hashkey
----  ---------------------  ----------------  ----------------  --------------------------------
   4  complex_outputs.ipynb  2020-02-23 20:33  2020-02-23 20:33  800c4a057730a55a384cfe579e3850aa
   3  basic_unrun.ipynb      2020-02-23 20:33  2020-02-23 20:33  818f3412b998fcf4fe9ca3cca11a3fc3
   2  basic_failing.ipynb    2020-02-23 20:33  2020-02-23 20:33  72859c2bf1e12f35f30ef131f0bef320

Each notebook is hashed (code cells and kernel spec only), which is used to compare against 'staged' notebooks. Multiple hashes for the same URI can be added (the URI is just there for description) and the size of the cache is limited (current default 1000) so that, at this size, the last accessed records begin to be deleted. You can remove cached records by the Primary Key (PK).

$ jcache remove-commits 3
Removing PK = 3
Success!

You can also diff any of the commit records with any (external) notebook:

$ jcache diff-nb 2 tests/notebooks/basic.ipynb 
nbdiff
--- committed pk=2
+++ other: /Users/cjs14/GitHub/sandbox/tests/notebooks/basic.ipynb
## inserted before nb/cells/1:
+  code cell:
+    execution_count: 2
+    source:
+      a=1
+      print(a)
+    outputs:
+      output 0:
+        output_type: stream
+        name: stdout
+        text:
+          1

## deleted nb/cells/1:
-  code cell:
-    source:
-      raise Exception('oopie')

If I stage some notebooks for execution, then you can list them to see which have existing records in the cache (by hash) and which will require execution:

$ jcache stage-nbs tests/notebooks/*
Staging: /Users/cjs14/GitHub/sandbox/tests/notebooks/basic.ipynb
Staging: /Users/cjs14/GitHub/sandbox/tests/notebooks/basic_failing.ipynb
Staging: /Users/cjs14/GitHub/sandbox/tests/notebooks/basic_unrun.ipynb
Staging: /Users/cjs14/GitHub/sandbox/tests/notebooks/complex_outputs.ipynb
Success!
$ jcache list-staged
  PK  URI                                    Created             Commit Pk
----  -------------------------------------  ----------------  -----------
   4  tests/notebooks/complex_outputs.ipynb  2020-02-23 20:48            4
   3  tests/notebooks/basic_unrun.ipynb      2020-02-23 20:48
   2  tests/notebooks/basic_failing.ipynb    2020-02-23 20:48            2
   1  tests/notebooks/basic.ipynb            2020-02-23 20:48

and finally, you can run a basic execution of the required notebooks:

$ jcache execute
Executing: /Users/cjs14/GitHub/sandbox/tests/notebooks/basic.ipynb
Success: /Users/cjs14/GitHub/sandbox/tests/notebooks/basic.ipynb
Executing: /Users/cjs14/GitHub/sandbox/tests/notebooks/basic_unrun.ipynb
Success: /Users/cjs14/GitHub/sandbox/tests/notebooks/basic_unrun.ipynb
Finished!
$ jcache list-staged
  PK  URI                                    Created             Commit Pk
----  -------------------------------------  ----------------  -----------
   5  tests/notebooks/basic.ipynb            2020-02-23 20:57            5
   4  tests/notebooks/complex_outputs.ipynb  2020-02-23 20:48            4
   3  tests/notebooks/basic_unrun.ipynb      2020-02-23 20:48            5
   2  tests/notebooks/basic_failing.ipynb    2020-02-23 20:48            2

I think that aligns more with what you had in mind @akhmerov?

@choldgraf
Copy link
Member

Thanks for this great summary. Curious if you could describe the need to move away from git. (not saying it isn't the right idea, just curious what the pros/cons are)

@chrisjsewell
Copy link
Member Author

Curious if you could describe the need to move away from git.

A few things; (a) because it makes the code simpler, with fewer dependencies, (b) you need to compute the special notebook hashes independently anyway (including only the code cells and kernel), which git won't do, (c) this approach allows for multiple potential outputs for the same notebook to be stored (e.g. if you have different git branches), which you wouldn't really be able to do with git, unless you somehow 'synced' it with the parent git repo.

@akhmerov
Copy link
Contributor

@chrisjsewell definitely, looks pretty good!

Can you explain the design decision of having a separate "staged" state, as opposed to always executing immediately? It seems reasonable, but I cannot quite identify the use case, and I'd like to understand better.

What controls the execution path/context? For the book project it will become important (e.g. there's a contract that the notebooks may refer to anything in ./content, and generate outputs in the current folder.

BTW: if the project targets python ≥ 3.5, I wholeheartedly recommend pathlib.Path for dealing with files. It is to os.path what click is to argparse.

@akhmerov
Copy link
Contributor

Another question: does the cache check existence exclusively based on the hash, or rather on the combination of hash + URI?

@chrisjsewell
Copy link
Member Author

BTW: if the project targets python ≥ 3.5, I wholeheartedly recommend pathlib.Path for dealing with files. It is to os.path what click is to argparse.

If you look in the code, you'll see I always use pathlib.Path 😄 In fact I think I've made it ≥ 3.6 by using f-strings, but that's another matter

@chrisjsewell
Copy link
Member Author

Another question: does the cache check existence exclusively based on the hash, or rather on the combination of hash + URI?

Exclusively on the hash. That was the goal right?

@akhmerov
Copy link
Contributor

If you look in the code, you'll see I always use pathlib.Path :) In fact I think I've made it ≥ 3.6 by using f-strings, but that's another matter

Ah, I see: one of the files uses os.path, and another pathlib.Path.

Exclusively on the hash. That was the goal right?

Absolutely. Just wanted to check if I understand correctly.

@akhmerov
Copy link
Contributor

Compared to the minimal implementation that I outlined in #7 this has:

  • Created/accessed timestamps, useful for pruning
  • URI, also handy if we want to search for all versions of one notebook
  • Staged state
  • Diffs, also seems handy
  • sqlite/sqlalchemy for an extra dependency

Overall I feel that it strikes a good balance between complexity and feature set. The extra dependencies are somewhat commonly used to not be a big nuisance. Plus, one can provide a drop-in replacement if need be.

@akhmerov
Copy link
Contributor

Another question: the cache is not the source of truth wrt contents of markdown cells?

For example, if I query it for a notebook that has the same code cells as one stored in cache, but different markdown cells, then the cache is supposed to be unreliable wrt markdown cells. Is that correct?

@chrisjsewell
Copy link
Member Author

Absolutely. Just wanted to check if I understand correctly.

Yeh the URIs are there purely for introspection of the cache.

Can you explain the design decision of having a separate "staged" state

Then main use case I guess is for manual control of the execution flow. It can be good to see what will get executed before you execute it; so you add a number of notebooks as staged, then you can inspect which ones will actually be re-executed before running the execution. I could even envisage getting fancy and staging notebooks with different execution parameters before firing them all off to the executor.

@chrisjsewell
Copy link
Member Author

chrisjsewell commented Feb 23, 2020

Another question: the cache is not the source of truth wrt contents of markdown cells?

Yes that's the current design; cell outputs and hence code execution should only be reliant on changes to source code or execution specific metadata (i.e. kernelspec), so markdown is ignored. Does that make sense?

@chrisjsewell
Copy link
Member Author

chrisjsewell commented Feb 23, 2020

What controls the execution path/context? For the book project it will become important (e.g. there's a contract that the notebooks may refer to anything in ./content, and generate outputs in the current folder.

Nothing really yet, there is just a basic 'executor' implemented that isolates a notebook in a tempfolder and sets the cwd to that tempfolder. This is just a placeholder though for more careful consideration over these questions.

@akhmerov
Copy link
Contributor

markdown is ignored. Does that make sense?

Yes, makes perfect sense. Perhaps it may even be cleaner to strip the cached notebooks of all markdown, so that there's no risk of using it.


More thoughts:

  1. What should controls whether the execution should tolerate errors? Would this be an API option (--errors-ok), a notebook metadata fields, a cell metadata field, or something else?
  2. This cache implementation might be useful to integrate with jupyter contents managers, so that the users see the notebook loaded immediately together with its outputs if it's cached.

@chrisjsewell
Copy link
Member Author

chrisjsewell commented Feb 23, 2020

Perhaps it may even be cleaner to strip the cached notebooks of all markdown, so that there's no risk of using it.

Agreed [edit: this has been implemented in 6c0083d]

@chrisjsewell
Copy link
Member Author

What should controls whether the execution should tolerate errors? Would this be an API option (--errors-ok), a notebook metadata fields, a cell metadata field, or something else?

I guess this should be in the notebook, so that it can be included in the hash.

@choldgraf
Copy link
Member

@mmcky will want to take a glance at this before the meeting on Monday imo 😁

These are additional files output by the notebook during execution.
These files must be in the same folder as the notebook, or a subfolder of it.
@akhmerov
Copy link
Contributor

As in there is no assumption that data in the cache has a physical path on disk. That was the idea behind parsing out io.BufferedReaders, rather than just paths to the artefacts.

Hold on, you're saying that not only the files are opened, but also there's no way to infer their path in case they actually are available on disk? That would likely result in performance degradation (read+write instead of copy).

@chrisjsewell
Copy link
Member Author

right now sqlalchemy does not support asyncio. At the same time, sqlite doesn't handle multiple writers very well.

Well I've certainly used sqlalchemy/sqlite, for multi-threading (using the older concurrent.futures approach) and haven't seen any issues. Also a project I know well; AiiDA uses sqlalchemy/postgres for a multi-processing application (the great thing about sqlalchemy/django is super easy to switch to a different engine if necessary). At any rate, its something to consider, but they are certainly better than trying to work with JSON files!

But more to the point though; the cache may not necessarily be involved in the actual parallel execution phase. It may be that it just provides everything to the 'executor', which does its thing, then provides back all the completed notebooks + artefacts at the end.

@akhmerov
Copy link
Contributor

But more to the point though; the cache may not necessarily be involved in the actual parallel execution phase.

Ah, great point. Even if cache is involved, there could be a blocking execute_all operation, which internally uses asyncio loop to parallelize the execution and doesn't write into the database.

@akhmerov
Copy link
Contributor

Still another question about the API (since this PR introduces essentially all of it):

Should there be a way to query outputs of a single cell? This will typically require reading the complete notebook from disk, reading one cell, and discarding the rest. Of course the users may also ask for the complete notebook from the start, but providing a likely inefficient API endpoint increases the maintenance, and promotes bad usage patterns.

@chrisjsewell
Copy link
Member Author

chrisjsewell commented Feb 24, 2020

Hold on, you're saying that not only the files are opened, but also there's no way to infer their path in case they actually are available on disk? That would likely result in performance degradation (read+write instead of copy).

Well you can 'abuse' the API and get the physical path, and it would take very little code to change the API to be 'physical path' based. But as I mentioned, then your blocking off any possibility for your cache to be non-local. It just depends on what magnitude this degradation will be in relation to the rest of the process. Any pointers to actual timings of the difference in rb->wb over shutil.copyfile? At the end of the day shutil.copyfile I imagine still uses some form of readwrite, just closer to the metal.

@chrisjsewell
Copy link
Member Author

Should there be a way to query outputs of a single cell?

As in this is what you want? Because thats already in the API:

    @abstractmethod
    def get_commit_codecell(self, pk: int, index: int) -> nbf.NotebookNode:
        """Return a code cell from a committed notebook.

        NOTE: the index **only** refers to the list of code cells, e.g.
        `[codecell_0, textcell_1, codecell_2]`
        would map {0: codecell_0, 1: codecell_2}
        """
        pass

@akhmerov
Copy link
Contributor

At the end of the day shutil.copyfile I imagine still uses some form of readwrite, just closer to the metal.

That strongly depends on the file system. Several modern ones (BTRFS, ZFS) do copy on write, in which case there would be no actual copying involved.

But as I mentioned, then your blocking off any possibility for your cache to be non-local.

I realize that (as an aside, there would also be workarounds for that). However my main point is that there is a trade-off involved, also for the cache consumer (if they only want to copy the file, they now also need to read it and write back).

@akhmerov
Copy link
Contributor

Should there be a way to query a single code cell? ...

As in this is what you want? Because thats already in the API:

I am rather wondering if it could be better to remove it for now, until there is a definite use case.

@chrisjsewell
Copy link
Member Author

chrisjsewell commented Feb 24, 2020

However my main point is that there is a trade-off involved

Oh yeh absolutely. Perhaps a middleground would be path access with a context manager; as in "this filepath is only guaranteed to exist within this with statement".

with cache.get_artifacts_path(pk=1) as path:
   shutil.copytree(path, "wherever")

@chrisjsewell
Copy link
Member Author

chrisjsewell commented Feb 24, 2020

I am rather wondering if it could be better to remove it for now

Yeh that's not a deal breaker. It mainly spawned from the earlier jupter-sphinx proposal I made, whereby you were gathering all the code from the docutils.nodes and leaving pointers to the cells, before executing all the code, then adding the outputs to the nodes in a post-transform.

@chrisjsewell
Copy link
Member Author

chrisjsewell commented Feb 24, 2020

the users may also ask for the complete notebook from the start

the one thing that should be made clear when recieving a notebook from the cache, is that the cell indices may not be the same as the input notebook. As in the number and order of code cells will be the same, but there may be differing numbers of markdown cells do that potentially nb_in.cells[3] != nb_cache.cells[3]

At the moment I've replaced all the markdown cells for cached notebooks with cell.source = "" and cell.metadata = {}. Why I didn't completely remove them was so that diffing would show the same cell indices as the source code, but I realise this can't be strictly guaranteed, so I think I'll just remove them completely.

@chrisjsewell
Copy link
Member Author

the one thing that should be made clear when recieving a notebook from the cache, is that the cell indices may not be the same as the input notebook.

Actually it would probably be good to supply a merge function, that injected the updated outputs into an input notebook. Then you could use it as the 'source of truth' for everything and make it available for download etc.

@akhmerov
Copy link
Contributor

More API questions: right now staging/committing requires a URI, which means that in-memory notebooks may not be served to cache. Should it be file_or_stream instead? Or even a NotebookNode object?

@chrisjsewell
Copy link
Member Author

chrisjsewell commented Feb 24, 2020

More API questions: right now staging/committing requires a URI

No there's commit_notebook_bundle which exactly parses a NotebookNode wrapped in a NbBundleIn that allows you to also specify a URI and artifacts. commit_notebook_file just reads in the notebook file then calls commit_notebook_bundle.

For the staging, at present I am not actually copying anything into the cache; just keeping a reference of the URI's in the database. I wasn't sure if it was worth the extra read/write/copy to do this, or just give the executor the uris of the notebooks to fetch (and possibly uris of assets).

- Commented out get_commit_codecell until use case established
- Added `commit_artefacts_temppath`
- Fully remove non-code cells from commited notebooks
@chrisjsewell
Copy link
Member Author

@akhmerov see 763e38f to address some of your points

@akhmerov
Copy link
Contributor

Looks good, thanks for bearing with me :)

@chrisjsewell
Copy link
Member Author

Looks good, thanks for bearing with me :)

Oh no its great feedback 👍

@chrisjsewell
Copy link
Member Author

Only main missing component now is execution assets.

I might just merge this soon and put that in an issue for later.

@chrisjsewell
Copy link
Member Author

Merging now, but feel free to continue discussion here, or open separate issues

@chrisjsewell chrisjsewell marked this pull request as ready for review February 25, 2020 06:05
@chrisjsewell chrisjsewell merged commit 1b19c75 into master Feb 25, 2020
@bsipocz bsipocz deleted the hash branch September 13, 2024 16:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants