Evaluating system-level provenance tools for practical use

Paper directory
Computational provenance := how did this file get produced? What binaries, data, libraries, other files that got used? What about the computational provenance of those files?
System-level provenance collects this data without knowing anything about the underlying programs (black box); just looking at syscalls or the like.
This paper is a lit review of provenance systems

Provenance research presentation to GNU/Linux User’s Group

Presentation directory

Provenance presentation to UBC

Presentation on Google Drive

Measuring provenance overheads

Paper directory
Take provenance systems and benchmarks from the lit review, apply all prov systems to all benchmarks
Reproducing: See REPRODUCING.md
Code directory
- prov_collectors.py contains “provenance collectors”
- workloads.py contains the “workloads”; The workloads have a “setup” and a “run” phase. For example, “setup” may download stuff (we don’t want to time the setup; that would just benchmark the internet service provider), whereas “run” will do the compile (we want to time only that).
- runner.py will select certain collectors and workloads; if it succeeds, the results get stored in .cache/, so subsequent executions with the same arguments will return instantly
- experiment.py contains the logic to run experiments (especailly cleaning up after them)
- run_exec_wrapper.py knows how to execute commands in a “clean” environment and cgroup
- Stats-larger.ipynb has the process to extract statistics using bayesian inference from the workflow runs
- flake.nix contains the Nix expressions which describe the environment in which everything runs
- result/ directory contains the result of building flake.nix; all binaries and executables should come from result/ in order for the experiment to be reproducible

Rapid review

Redo rapid review with snowballing

e.g., https://dew-uff.github.io/scripts-provenance/selected.html
e.g., https://dew-uff.github.io/scripts-provenance/graph.html
https://joaofelipe.github.io/snowballing/start.html
https://dl.acm.org/doi/10.1145/2601248.2601268

Include record/replay terms

Add sciunit
Add reprozip
Add DetTrace
Add CDE
Add Burrito
Add Sumatra

Get workloads to work

Get Apache to compile

We need to get src_sh{./result/bin/python runner.py apache} to work

Cannot find pcre-config

I invoke src_sh{./configure –with-pcre-config=/path/to/pcre-config}, and ./configure will still complain (“no pcre-config found”).
I ended up patching with httpd-configure.patch.

lber.h not found

/nix/store/2z0hshv096hhavariih722pckw5v150v-apr-util-1.6.3-dev/include/apr_ldap.h:79:10: fatal error: lber.h: No such file or directory

Get Spack workloads to compile

We need to get src_sh{./result/bin/python runner.py spack} to work
See docstring of SpackInstall in workloads.py.
Spack installs a target package (call it $spec) and all of $spec’s dependencies. Then it removes $spec, while leaving the dependencies.

Write a `Workload` class for Apache + ApacheBench

Compiling Apache is an interesting benchmark, but running Apache with a predefined request load is also an interesting benchmark.
We should write a new class called ApacheLoad that installs Apache in its setup() (for simplicity, we won’t reuse the version we built earlier), downloads a ApacheBench, and in the run() runs the server with the request load using only tools from result/ or .work/.

Compile Linux benchmark

Write a class that compiles the Linux kernel (just the kernel, no user-space software), using only tools from result/.
The benchmark should use a specific pin of the Linux kernel and set kernel build options. Both should be customizable and set by files that are checked into Git. However, the Linux source tree should not be checked into Git (see build Apache, where I download the source code in setup() and cache it for future use).

Postmark workload

https://www.filesystems.org/docs/auto-pilot/Postmark.html
See Hi-Fi, PASSv2, LPM, CamFlow for details
pm>set transactions 400000

lmbench benchmark

https://lmbench.sourceforge.net/

Write a ProFTPD benchmark

https://github.com/selectel/ftpbench

Refactor BLAST workloads

It should be easy to run them a large consistent set of many different BLAST apps.
Maybe have a 1 min, 10 min, and 60 min randomly-selected, but fixed, configuration

Create mercurial/VCS workload

https://savannah.gnu.org/hg/?group=octave
https://hg.mozilla.org/mozilla-central/
https://github.com/frej/fast-export
https://wiki.mercurial-scm.org/ConvertExtension
https://hg-git.github.io/
https://repo.mercurial-scm.org/hg

[#A] Workflow benchmarks

CleanML https://chu-data-lab.github.io/CleanML/
Spark https://www.databricks.com/blog/2017/10/05/build-complex-data-pipelines-with-unified-analytics-platform.html
Snakemake/nf-core workflows

[#A] ML benchmarks

[#A] Simulation benchmarks

YT https://yt-project.org/doc/cookbook/index.html
YT https://prappleizer.github.io/#tutorials
YT https://trident.readthedocs.io/en/latest/annotated_example.html
YT https://github.com/PyLCARS/YT_BeyondAstro

[#A] Filebench benchmark

https://github.com/filebench/filebench

[#A] Shellbench

https://github.com/shellspec/shellbench

[#A] Include xz in workload

BACKLOG Make browser benchmarks

Run Chromium and Firefox with Sunspider
https://github.com/v8/v8/blob/04f51bc70a38fbea743588e41290bea40830a486/test/benchmarks/csuite/csuite.py#L4

BACKLOG SSH

https://github.com/LineRate/ssh-perf

BACKLOG THTTPD and cherokee

http://www.acme.com/software/thttpd/ https://github.com/larryhe/tinyhttpd https://github.com/mendsley/tinyhttp https://cherokee-project.com/

BACKLOG SPEC CPU 2006

Determine if we need just int or also fp benchmarks
https://www.spec.org/cpu2006/Docs/
https://www.spec.org/sources/
https://github.com/miyuki/spec-cpu2006-redist/
https://www.spec.org/cpu2006/Docs/tools-build.html
https://www.spec.org/cpu2006/Docs/install-guide-unix.html
https://www.spec.org/cpu2006/Docs/runspec.html

BACKLOG Create CVS workload

http://cvs.savannah.gnu.org/viewvc/cvs/ccvs/

BACKLOG VIC

Fig 1 of https://arxiv.org/pdf/1707.05731.pdf
https://github.com/Chicago/food-inspections-evaluation/tree/master/CODE

BACKLOG FIE

Fig 7 of https://arxiv.org/pdf/1707.05731.pdf
Fig 1 of https://doi.org/10.1016/j.envsoft.2015.12.010
https://github.com/uva-hydroinformatics/VIC_Pre-Processing_Rules/

BACKLOG Run xSDK codes

https://github.com/xsdk-project/xsdk-examples
https://github.com/LBL-EESA/alquimia-dev

BACKLOG Investigate Sysbench

https://doi.org/10.1145/2508859.2516731

BACKLOG investigate BT-IO

https://www.nas.nasa.gov/software/npb.html

Make API easier to use

Write `run.py`

Just runs one workload
–setup, –main, –teardown

Refactor `runner.py`

Change to run_store_analyze.py
runner.py mixes code for selecting benchmarks and prov collectors with code for summarizing statistical outputs.
Use –benchmarks and –collectors to form a grid
Accept –iterations, –seed, –fail-first
Accept –analysis $foo
Should have an –option to import external workloads and prov_collectors
Should have –re-run, which removes .cache/results_* and .cache/$hash

Refactor `stats.py`

Should have Callable[pandas.DataFrame, None]

[#A] Allow classes to specify Nix packages

setup() should do nix build and add to path

Refactor `workloads.py`

Should accept a tempdir
Should be smaller
Should have teardown

Refactor `run_exec_wrapper.py`

Should fail gracefully when cgroups are not available, or even degrade to using no containers

Document user interface

Make easier to install

[#C] Package Python code for PyPI using Poetry

Provenance collectors

Fix Sciunits

We need to get src_sh{./result/bin/python runner.py sciunit} to work.
Sciunit is a Python package which depends on a binary called ptu.
Sciunit says “sciunit: /nix/store/7x6rlzd7dqmsa474j8ilc306wlmjb8bp-python3-3.10.13-env/lib/python3.10/site-packages/sciunit2/libexec/ptu: No such file or directory”, but on my system, that file does exist! Why can’t sciunits find it?
Answer: That file exists; it is an ELF binary, it’s “interpreter” is set to /lib64/linux-something.so. That interpreter does not exist. I replaced this copy of ptu with the nix-built copy of ptu.

Fix sciunit

Fix strace unparsable lines

Fix rr to measure storage overhead

Package CARE

https://proot-me.github.io/care/

Package/write-up PTU

https://www.usenix.org/system/files/conference/tapp13/tapp13-final18.pdf

[#A] Debug PTU

[#A] Research Parrot

[#C] Write BPF trace

We need to write a basic prov collector for BPF trace. The collector should log files read/written by the process and all children processes. Start by writing prov.bt.

[#C] Fix Spade+FUSE

We need to get src_sh{./result/bin/python runner.py spade_fuse} to work.

[#C] Get SPADE Neo4J database to work

src_sh{./result/bin/spade start && echo “add storage Neo4J $PWD/db” | ./result/bin/spade control}
Currently, that fails with “Adding storage Neo4J… error: Unable to find/load class”
The log can be found in ~~/.local/share/SPADE/current.log~.
~/.local/share/SPADE/lib/neo4j-community/lib/*.jar contains Neo4J classes. I believe these are on the classpath. However, this is a different version of Java or something like that, which refuses to load those jars.

BACKLOG discuss VAMSA

https://dl.acm.org/doi/pdf/10.1145/3394486.3403205

BACKLOG Build CentOS packages

See @shiExperienceReportProducing2022. Could leverage https://pypi.org/project/reprotest/

Stats

Measure arithmetic intensity for each

IO calls / CPU sec, where CPU sec is itself a random variable

Measure slowdown as a function of arithmetic intensity

See States-larger.ipynb

[#C] Count dynamic instructions in entire program

IO calls / 1M dynamic instruction

Plot IO vs CPU sec

Plot confidence interval of slowdown per arithmetic intensity

Evaluate prediction based on arithmetic intensity

slowdown(prov_collector) * cpu_to_wall_time(workload) * runtime(workload) ~ runtime(workload, prov_collector)
What is the expected percent error?

Characterize benchmarks and benchmark classes by syscall breakdown

Features: count of each group of syscalls / total time
Prog should occupy the same point as {Prog, Prog} (that is, analogous to intensive not extensive properties in physics)
PCA and clustering and dendrogram
- Sec 3 of https://doi.org/10.1109/ISPASS.2005.1430555
- Sec 9 of https://doi.org/10.1145/1167473.1167488
https://doi.org/10.1109/IISWC.2006.302733

BACKLOG Revise bayesian model to use benchmark class

How many classes and benchmarks does one need?

BACKLOG Use G-means or X-means to learn the number of clusters

Writing

Write introduction

Write background

Write literature rapid review section

Write benchmark and prov collector collection

Revise introduction (60)

Smoosh Motivation and Background together
Lead with the problem
1 problem -> provenance (vs perf overhead) -> 3 other problems solved -> 3 ways to gather

Explain how strace, ltrace, fsatrace, rr got to be there

Explain how Sciunits, ReproZip got to be there

Describe experimental results

[#B] Explain the capabilities/features of each prov tracer

Table of capabilities (vDSO)

Discussion

What provenance methods are most promising?
Threats to validity
Mathematical model
Few of the tools are applicable to comp sci due to methods
How many work for distributed systems
How to handle network
Microbechmark vs applications?
Non-negative linear regression

[#B] Story-telling

Gaps in prior work re comp sci
Stakeholder perspectives:
- Tool developers, users, facilities people
Longterm archiving an execution, such that it is re-executable
I/O defn? I/O includes stuff like username, clock_gettime

Conclusion

Threats to validity

Background

Page-limit

Reproducibility appendix

Need Intel CPU?

Why not VMs?

BACKLOG Record/replay reproducibility with library interposition

Paper directory
Record/replay is an easier way to get reproducibility than Docker/Nix/etc.
Use library interpositioning to make a record/replay tool that is faster than other record/replay tools

Get global state vars

Library constructors get called twice (2 copies of library global variables)
https://stackoverflow.com/questions/77782964/how-to-run-code-exactly-once-in-ld-preloaded-shared-library

Files

tasks.org

Latest commit

History

tasks.org

File metadata and controls

Evaluating system-level provenance tools for practical use

Provenance research presentation to GNU/Linux User’s Group

Provenance presentation to UBC

Measuring provenance overheads

Rapid review

Redo rapid review with snowballing

Include record/replay terms

Get workloads to work

Get Apache to compile

Cannot find pcre-config

lber.h not found

Get Spack workloads to compile

Write a Workload class for Apache + ApacheBench

Compile Linux benchmark

Postmark workload

lmbench benchmark

Write a ProFTPD benchmark

Refactor BLAST workloads

Create mercurial/VCS workload

[#A] Workflow benchmarks

[#A] ML benchmarks

[#A] Simulation benchmarks

[#A] Filebench benchmark

[#A] Shellbench

[#A] Include xz in workload

BACKLOG Make browser benchmarks

BACKLOG SSH

BACKLOG THTTPD and cherokee

BACKLOG SPEC CPU 2006

BACKLOG Create CVS workload

BACKLOG VIC

BACKLOG FIE

BACKLOG Run xSDK codes

BACKLOG Investigate Sysbench

BACKLOG investigate BT-IO

Make API easier to use

Write run.py

Refactor runner.py

Refactor stats.py

[#A] Allow classes to specify Nix packages

Refactor workloads.py

Refactor run_exec_wrapper.py

Document user interface

Make easier to install

[#C] Package Python code for PyPI using Poetry

Provenance collectors

Fix Sciunits

Fix sciunit

Fix strace unparsable lines

Fix rr to measure storage overhead

Package CARE

Package/write-up PTU

[#A] Debug PTU

[#A] Research Parrot

[#C] Write BPF trace

[#C] Fix Spade+FUSE

[#C] Get SPADE Neo4J database to work

BACKLOG discuss VAMSA

BACKLOG Build CentOS packages

Stats

Measure arithmetic intensity for each

Measure slowdown as a function of arithmetic intensity

[#C] Count dynamic instructions in entire program

Plot IO vs CPU sec

Plot confidence interval of slowdown per arithmetic intensity

Evaluate prediction based on arithmetic intensity

Characterize benchmarks and benchmark classes by syscall breakdown

BACKLOG Revise bayesian model to use benchmark class

BACKLOG Use G-means or X-means to learn the number of clusters

Writing

Write introduction

Write background

Write literature rapid review section

Write benchmark and prov collector collection

Write a `Workload` class for Apache + ApacheBench

Write `run.py`

Refactor `runner.py`

Refactor `stats.py`

Refactor `workloads.py`

Refactor `run_exec_wrapper.py`