-
Notifications
You must be signed in to change notification settings - Fork 9
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Update docs for persistent prov (#80)
- Loading branch information
1 parent
194570c
commit f6400b8
Showing
1 changed file
with
27 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,27 @@ | ||
`persistent_provenance.py` implements persistent (between-process) provenance. | ||
|
||
`probe record ...` efficiently tracks provenance within a single process, writing the result to `probe_log`. | ||
If the process which reads a file is not the same as the process which writes one, but they have a common ancestor parent, this works well. | ||
For example, suppose a compiler reads `main.c` and writes `main.o`, and a linker reads `main.o` and writes `main.a`. | ||
- PROBEing just the compiler will not capture the full usage of the .c files; | ||
- PROBEing just the linker will not capture the full source of the .a files; | ||
- However PROBEing the make which invokes both (make is a common ancestor) sufficiently captures the sources and uses of all the files involved. | ||
|
||
However, there are cases where there is not a common ancestor process: | ||
- The computation could be could be a multi-node (a process on machine A and process on machine B have no common ancestor). | ||
- The computation could be carried out between restarts (process writes file, restart machine, process reads file). | ||
|
||
Therefore, we will write the dataflow DAG to disk in [XDG data home](https://wiki.archlinux.org/title/XDG_Base_Directory) at transcription-time. | ||
The result is a gigantic dataflow DAG that can span multiple invocations of PROBE, multiple boots, and perhaps even operations on multiple hosts. | ||
If we ran `gcc` on remote `X` and `scp`ed the result back, those could all appear as nodes in the DAG. | ||
|
||
Common queries: | ||
- Upward (direction of dataflow) queries: | ||
- What outputs were dependent on this input? (aka push-based updating). If a user overwrites a particular data file, they may want to regenerate every currently extant output which depended on that data file. | ||
- Downward (opposite of dataflow) queries: | ||
- What inputs were used to make this output? (aka pull-based updating). This query is used in applications like "Make-without-Makefile" application. | ||
- When the user does an SCP or Rsync, extract the "relevant" bits of provenance to the remote, so a user at the destination-machine (destination could be local or remote) can query the provenance of the files we are sending. | ||
|
||
We need to be able to query the graph in both directions. | ||
|
||
While a graph database would be more efficient, sqlite is very battle-tested and does not require a daemon process. |