Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache requirements and a minimal implementation #7

Open
akhmerov opened this issue Feb 22, 2020 · 2 comments
Open

Cache requirements and a minimal implementation #7

akhmerov opened this issue Feb 22, 2020 · 2 comments

Comments

@akhmerov
Copy link
Contributor

akhmerov commented Feb 22, 2020

I would like to document my thoughts wrt caching, partly inspired by the exploration of different approaches by @chrisjsewell. I hope these would be useful.

Requirements

  1. I consider the cache limited in scope to the task of building a book/site out of a collection of input files containing code to be executed and possibly other scripts.
  2. Rebuilding the complete cache may take a few minutes, but is unlikely going to be much longer.
  3. I expect that the execution will use the notebook abstraction, i.e. the input to the execution is a sequence of notebooks, with each notebook containing a kernel name and a sequence of cells to be executed.
  4. The notebooks must adhere to the following contract:
    • They should rely on assets in a controlled location (e.g. same folder as the source files).
    • Their execution result should be the same regardless of the order in which it was carried out.
    • The notebooks may write additional files in a different specified location.
  5. The caching logic should not determine whether the external dependencies (scripts/installed libraries) have invalidated the outputs of the notebook because it is too complex to implement.
  6. The end users shouldn't learn how to operate the cache, beyond "wipe it clean".

Minimal implementation

  • Create a folder for the cache within sphinx build directory
  • Whenever the build process encounters a notebook, it hashes (kernel_name, code_cells), creates a subfolder with that hash, links the notebook execution context from that folder, executes the notebook, and writes it in that folder.
  • Invalidation is either sphinx clean or deleting the cache folder.
  • When collecting the execution artifacts, sphinx copies all files from these folders.
@chrisjsewell
Copy link
Member

Hmm, I agree with ~most of these points.

Rebuilding the complete cache may take a few minutes, but is unlikely going to be much longer.

You mean re-running all the notebooks? Well Jupinx take a few hours to rebuild all theirs, so I think that's a bit optimistic.

Create a folder for the cache within sphinx build directory

Just to clarify, the cache has nothing to do with sphinx. Sphinx may use it, but it should be able to be used independently.

@akhmerov
Copy link
Contributor Author

akhmerov commented Feb 22, 2020

Just to clarify, the cache has nothing to do with sphinx. Sphinx may use it, but it should be able to be used independently.

Indeed, keeping the cache folder within sphinx build folder is how I imagine sphinx could use the cache.

You mean re-running all the notebooks? Well Jupinx take a few hours to rebuild all theirs, so I think that's a bit optimistic.

Fair enough. I have a course that takes about an hour to build sequentially, indeed.


Additions/observations based on the above:

  • Isolating the outputs of each notebook into a folder is right now missing from Proposal for git based cache #6. Without that we cannot tell if a non-notebook artifact should be used or not.
  • I am wondering if the expectations of what exactly the notebook output produces make the cache useful broader than for book building. These seem rather specific.
  • I agree that the option of manual invalidation seems useful for very long courses. How about cache location being configurable. Computationally cheap projects could store everything in sphinx build folder, and benefit from easier cleanup, while the more expensive ones would need a more fine-grained cache control and have it outside? At the same time, if there's a good CLI external to sphinx that doesn't advertise fine-grained cache manipulation too much, also an external cache folder doesn't hurt.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants