Update docs for persistent prov (#80)

charmoniumQ · Dec 11, 2024 · f6400b8 · f6400b8
1 parent 194570c
commit f6400b8
Showing 1 changed file with 27 additions and 0 deletions.
diff --git a/docs/persistent_prov.md b/docs/persistent_prov.md
@@ -0,0 +1,27 @@
+`persistent_provenance.py` implements persistent (between-process) provenance.
+
+`probe record ...` efficiently tracks provenance within a single process, writing the result to `probe_log`.
+If the process which reads a file is not the same as the process which writes one, but they have a common ancestor parent, this works well.
+For example, suppose a compiler reads `main.c` and writes `main.o`, and a linker reads `main.o` and writes `main.a`.
+- PROBEing just the compiler will not capture the full usage of the .c files;
+- PROBEing just the linker will not capture the full source of the .a files;
+- However PROBEing the make which invokes both (make is a common ancestor) sufficiently captures the sources and uses of all the files involved.
+
+However, there are cases where there is not a common ancestor process:
+- The computation could be could be a multi-node (a process on machine A and process on machine B have no common ancestor).
+- The computation could be carried out between restarts (process writes file, restart machine, process reads file).
+
+Therefore, we will write the dataflow DAG to disk in [XDG data home](https://wiki.archlinux.org/title/XDG_Base_Directory) at transcription-time.
+The result is a gigantic dataflow DAG that can span multiple invocations of PROBE, multiple boots, and perhaps even operations on multiple hosts.
+If we ran `gcc` on remote `X` and `scp`ed the result back, those could all appear as nodes in the DAG.
+
+Common queries:
+- Upward (direction of dataflow) queries:
+  - What outputs were dependent on this input? (aka push-based updating). If a user overwrites a particular data file, they may want to regenerate every currently extant output which depended on that data file.
+- Downward (opposite of dataflow) queries:
+  - What inputs were used to make this output? (aka pull-based updating). This query is used in applications like "Make-without-Makefile" application.
+  - When the user does an SCP or Rsync, extract the "relevant" bits of provenance to the remote, so a user at the destination-machine (destination could be local or remote) can query the provenance of the files we are sending.
+
+We need to be able to query the graph in both directions.
+
+While a graph database would be more efficient, sqlite is very battle-tested and does not require a daemon process.