From f522bf4c0c3f51b4e13709078493c0c633fb4b04 Mon Sep 17 00:00:00 2001
From: Zixian Cai <u5937495@anu.edu.au>
Date: Fri, 11 Aug 2023 08:09:56 +0000
Subject: [PATCH 1/9] Add tracing tools and documentation

These tools utilize the tracepoints added in #883

Co-authored-by: Claire Huang <claire.x.huang@gmail.com>
---
 tools/README.md                            | 193 +++++++++++++++++++++
 tools/alloc_slow.bt                        |  11 ++
 tools/epilogue.bt.fragment                 |   7 +
 tools/gc_stages.bt                         |  28 +++
 tools/lock_contended.bt                    |  11 ++
 tools/packet_size.bt                       |   8 +
 tools/prologue_with_harness.bt.fragment    |  10 ++
 tools/prologue_without_harness.bt.fragment |   4 +
 tools/run.py                               |  57 ++++++
 9 files changed, 329 insertions(+)
 create mode 100644 tools/README.md
 create mode 100644 tools/alloc_slow.bt
 create mode 100644 tools/epilogue.bt.fragment
 create mode 100644 tools/gc_stages.bt
 create mode 100644 tools/lock_contended.bt
 create mode 100644 tools/packet_size.bt
 create mode 100644 tools/prologue_with_harness.bt.fragment
 create mode 100644 tools/prologue_without_harness.bt.fragment
 create mode 100755 tools/run.py

diff --git a/tools/README.md b/tools/README.md
new file mode 100644
index 0000000000..1c28505569
--- /dev/null
+++ b/tools/README.md
@@ -0,0 +1,193 @@
+# MMTk performace tracing
+
+## Notes for MMTk developers
+Please open pull requests if you develop new tools that others might find useful. When you add new tools, please update this documentation.
+If you change MMTk internals that the tracing tools depend on (such as the definition of `enum WorkBucketStage`), please update the scripts accordingly.
+
+## Notes for MMTk users
+Since some of the tools depend on the MMTk internals, please use the tools shipped with the MMTk release you use.
+
+## Tracepoints
+Currently, the core provides the following tracepoints.
+- `mmtk:collection_initialized()`: GC is enabled
+- `mmtk:harness_begin()`: the timing iteration of a benchmark begins
+- `mmtk:harness_end()`: the timing iteration of a benchmark ends
+- `mmtk:gccontroller_run()`: the GC controller thread enters its work loop
+- `mmtk:gcworker_run()`: a GC worker thread enters its work loop
+- `mmtk:gc_start()`: a collection epoch starts
+- `mmtk:gc_end()`: a collection epoch ends
+- `mmtk:process_edges(num_edges: int, is_roots: bool)`: a invocation of the `process_edges` method. The first argument is the number of edges to be processed, and the second argument is whether these edges are root edges.
+- `mmtk:bucket_opened(id: int)`: a work bucket opened. The first argument is the numerical representation of `enum WorkBucketStage`.
+- `mmtk:work_poll()`: a work packet is to be polled.
+- `mmtk:work(type_name: char *, type_name_len: int)`: a work packet was just executed. The first argument is points to the string of the Rust type name of the work packet, and the second argument is the length of the string.
+- `mmtk:alloc_slow_once_start()`: the allocation slow path starts.
+- `mmtk:alloc_slow_once_end()`: the allocation slow path ends.
+
+## Running tracing tools
+The tracing tools are to be invoked by a wrapper script `run.py`.
+```
+usage: run.py [-h] [-b BPFTRACE] -m MMTK [-H] [-p] [-f {text,json}] tool
+
+positional arguments:
+  tool                  Name of the bpftrace tool
+
+optional arguments:
+  -h, --help            show this help message and exit
+  -b BPFTRACE, --bpftrace BPFTRACE
+                        Path of the bpftrace executable
+  -m MMTK, --mmtk MMTK  Path of the MMTk binary
+  -H, --harness         Only collect data for the timing iteration (harness_begin/harness_end)
+  -p, --print-script    Print the content of the bpftrace script
+  -f {text,json}, --format {text,json}
+                        bpftrace output format
+```
+
+- `-b`: the path to the `bpftrace` executable. By default, it uses `bpftrace` executable in your `PATH`. We strongly recommend you use the latest statically complied `bpftrace` from [upstream](https://github.com/iovisor/bpftrace/releases). You need to be able to have sudo permission for whichever `bpftrace` you want to use.
+- `-m`: the path to a MMTk binary that contains the tracepoints. This depends on the binding you use. For the OpenJDK binding, it should be `jdk/lib/server/libmmtk_openjdk.so` under your build folder. To check whether the binary contains tracepoints, you can use `readelf -n`. You should see a bunch of `stapsdt` notes with `mmtk` as the provider.
+- `-H`: pass this flag is you want to only measure the timing iteration of a benchmark. By default, the tracing tools will measure the entire execution.
+- `-p`: print the entire tracing script before execution. This is mainly for debugging use.
+- `-f`: change the bpftrace output format. By default, it uses human-readable plain text output (`text`). You can set this to `json` for easy parsing.
+
+Please run the tracing tools **before** running the workload. If you use `-H`, the tracing tools will automatically end with `harness_end` is called. Otherwise, you will need to terminate the tools manually with `Ctrl-C`. These tools also have a timeout of 1200 seconds so not to stall unattended benchmark execution. 
+
+## Tracing tools
+### Measuring the time spend in allocation slow path (`alloc_slow`)
+This tool measures the distribution of the allocation slow path time.
+The time unit is ns / 400, so that we use the histogram bins with higher fidelity better.
+
+Sample output:
+```
+@alloc_slow_hist: 
+[4, 8)               304 |@                                                   |
+[8, 16)            12603 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
+[16, 32)            8040 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                   |
+[32, 64)             941 |@@@                                                 |
+[64, 128)            171 |                                                    |
+[128, 256)            13 |                                                    |
+[256, 512)             2 |                                                    |
+[512, 1K)              0 |                                                    |
+[1K, 2K)               0 |                                                    |
+[2K, 4K)               0 |                                                    |
+[4K, 8K)               0 |                                                    |
+[8K, 16K)              0 |                                                    |
+[16K, 32K)            14 |                                                    |
+[32K, 64K)            37 |                                                    |
+[64K, 128K)           19 |                                                    |
+[128K, 256K)           1 |                                                    |
+```
+
+In the above output, we can see that most allocation slow paths finish between 3.2us and 6.4us. However, there is a long tail, presumably due to GC pauses.
+
+### Measuring the time spend in different GC stages (`gc_stages`)
+This tool measures the time spent in different stages of GC: before `Closure`, during `Closure`, and after `Closure`.
+The time unit is ns.
+
+Sample output:
+```
+@closure_time: 1405302743
+@post_closure_time: 81432919
+@pre_closure_time: 103886118
+```
+
+In the above output, overall, the execution spends 1.4s in the main transitive closure, 103ms before that, and 81ms after that (a total of around 1.5s).
+
+### Measuring the time spend in lock contended state for Rust `Mutex` (`lock_contend`)
+This tools measures the time spent in the lock contended state for Rust `Mutex`s. The Rust standard library implements `Mutex` using the fast-slow-path paradigm. Most lock operations take place in inlined fast paths, when there's no contention. However, when there's contention, `std::sys::unix::locks::futex_mutex::Mutex::lock_contended` is called.
+
+```rust
+#[inline]
+pub fn lock(&self) {
+    if self.futex.compare_exchange(0, 1, Acquire, Relaxed).is_err() {
+        self.lock_contended();
+    }
+}
+
+#[cold]
+fn lock_contended(&self) {
+    <snip>
+}
+```
+
+
+MMTk uses Rust `Mutex`, e.g., in allocation slow paths for synchronization, and this tool can be useful to measure the contention in these parts of code.
+
+The time unit is ns / 256.
+
+Sample output:
+```
+@lock_dist[140637228007056]: 
+[1]                  447 |@@@@                                                |
+[2, 4)              3836 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@             |
+[4, 8)              3505 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                |
+[8, 16)             1354 |@@@@@@@@@@@@@@                                      |
+[16, 32)             832 |@@@@@@@@                                            |
+[32, 64)            1077 |@@@@@@@@@@@                                         |
+[64, 128)           2991 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                     |
+[128, 256)          4846 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  |
+[256, 512)          5013 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
+[512, 1K)           1203 |@@@@@@@@@@@@                                        |
+[1K, 2K)              34 |                                                    |
+[2K, 4K)              15 |                                                    |
+```
+
+In the above output, we can see that the lock instance (140637228007056, or 0x7fe8a8047e90) roughly has a bimodal distribution in terms of the time spent in lock contended code path. The first peak is around 512ns~1024ns, and the second peak is around 66us~131us.
+If you can't tell which lock instance is for which lock in MMTk, you can trace the allocation of the Mutex and record the stack trace (note that you might want to compile MMTk with `force-frame-pointers` to obtain better stack traces).
+
+### Measuring the distribution of `process_edges` packet sizes (`packet_size`)
+Most of the GC time is spend in the transitive closure for tracing-based GCs, and MMTk performs transitive closure via work packets that calls the `process_edges` method.
+This tool measures the distribution of the sizes of these work packets, and also count root edges separately.
+
+Sample output:
+```
+@process_edges_packet_size: 
+[1]                  238 |@@@@@                                               |
+[2, 4)               806 |@@@@@@@@@@@@@@@@@                                   |
+[4, 8)              1453 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                     |
+[8, 16)             1105 |@@@@@@@@@@@@@@@@@@@@@@@                             |
+[16, 32)            2410 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
+[32, 64)            1317 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@                        |
+[64, 128)           1252 |@@@@@@@@@@@@@@@@@@@@@@@@@@@                         |
+[128, 256)          1131 |@@@@@@@@@@@@@@@@@@@@@@@@                            |
+[256, 512)          2017 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@         |
+[512, 1K)           1270 |@@@@@@@@@@@@@@@@@@@@@@@@@@@                         |
+[1K, 2K)            1028 |@@@@@@@@@@@@@@@@@@@@@@                              |
+[2K, 4K)             874 |@@@@@@@@@@@@@@@@@@                                  |
+[4K, 8K)            1024 |@@@@@@@@@@@@@@@@@@@@@@                              |
+[8K, 16K)             58 |@                                                   |
+[16K, 32K)             5 |                                                    |
+
+@process_edges_root_packet_size: 
+[1]                   71 |@@@@@@@                                             |
+[2, 4)                 4 |                                                    |
+[4, 8)               276 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@                        |
+[8, 16)              495 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
+[16, 32)             477 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  |
+[32, 64)             344 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                |
+[64, 128)            242 |@@@@@@@@@@@@@@@@@@@@@@@@@                           |
+[128, 256)           109 |@@@@@@@@@@@                                         |
+[256, 512)            31 |@@@                                                 |
+[512, 1K)             33 |@@@                                                 |
+[1K, 2K)              75 |@@@@@@@                                             |
+[2K, 4K)              75 |@@@@@@@                                             |
+[4K, 8K)             336 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                 |
+[8K, 16K)             56 |@@@@@                                               |
+[16K, 32K)             3 |                                                    |
+```
+
+In the above output, we can see that overall, the sizes of the `process_edges` has a unimodal distribution with a peak around 16~32 edges per packet. However, if we focus on root edges, the distribution is roughly bimodal, with a first peak around 8~16 and a second peak around 4096~8192.
+
+## Attribution
+If used for research, please cite the following publication (the `BibTeX` record will be updated once a DOI is assigned).
+```bibtex
+@inproceedings{conf/mplr/Huang23,
+  author       = {Claire Huang and
+                  Stephen M. Blackburn and
+                  Zixian Cai},
+  title        = {Improving Garbage Collection Observability with Performance Tracing},
+  booktitle    = {Proceedings of the 20th International Conference on Managed Programming
+                  Languages and Runtimes, {MPLR} 2023, Cascais, Portugal, October
+                  22, 2023},
+  publisher    = {{ACM}},
+  year         = {2023}
+}
+```
diff --git a/tools/alloc_slow.bt b/tools/alloc_slow.bt
new file mode 100644
index 0000000000..202eea1e6b
--- /dev/null
+++ b/tools/alloc_slow.bt
@@ -0,0 +1,11 @@
+usdt:$MMTK:mmtk:alloc_slow_once_start {
+    if (@stats_enabled) {
+        @alloc_slow_nsecs[tid] = nsecs;
+    }
+}
+
+usdt:$MMTK:mmtk:alloc_slow_once_end {
+    if (@stats_enabled) {
+        @alloc_slow_hist = hist((nsecs - @alloc_slow_nsecs[tid])/400);
+    }
+}
diff --git a/tools/epilogue.bt.fragment b/tools/epilogue.bt.fragment
new file mode 100644
index 0000000000..70b82a4eb7
--- /dev/null
+++ b/tools/epilogue.bt.fragment
@@ -0,0 +1,7 @@
+END {
+    system("rm $TMP_FILE");
+}
+
+interval:s:1200 {
+    exit();
+}
diff --git a/tools/gc_stages.bt b/tools/gc_stages.bt
new file mode 100644
index 0000000000..1dab57d99e
--- /dev/null
+++ b/tools/gc_stages.bt
@@ -0,0 +1,28 @@
+usdt:$MMTK:mmtk:gc_start {
+    if (@stats_enabled) {
+        @gc_start_nsecs = nsecs;
+    }
+}
+
+usdt:$MMTK:mmtk:gc_end {
+    if (@stats_enabled) {
+        @post_closure_time += nsecs - @post_closure_nsecs;
+    }
+}
+
+usdt:$MMTK:mmtk:bucket_opened {
+    if (@stats_enabled) {
+        $ns = nsecs;
+        // Please check enum WorkBucketStage for the numerical values of stages
+        // Closure is 2 when vo_bit is not set
+        if (arg0 == 2) {
+            @closure_nsecs = $ns;
+            @pre_closure_time += $ns - @gc_start_nsecs;
+        }
+        // Release is 14 when vo_bit is not set
+        if (arg0 == 14) {
+            @post_closure_nsecs = $ns;
+            @closure_time += $ns - @closure_nsecs;
+        }
+    }
+}
diff --git a/tools/lock_contended.bt b/tools/lock_contended.bt
new file mode 100644
index 0000000000..044ed61b58
--- /dev/null
+++ b/tools/lock_contended.bt
@@ -0,0 +1,11 @@
+uprobe:$MMTK:_ZN3std3sys4unix5locks11futex_mutex5Mutex14lock_contended17he467d89501d7157aE {
+    if (@stats_enabled) {
+        @lock_nsecs[tid] = (arg0, nsecs);
+    }
+}
+
+uretprobe:$MMTK:_ZN3std3sys4unix5locks11futex_mutex5Mutex14lock_contended17he467d89501d7157aE {
+    if (@stats_enabled) {
+        @lock_dist[@lock_nsecs[tid].0] = hist((nsecs - @lock_nsecs[tid].1)/256);
+    }
+}
diff --git a/tools/packet_size.bt b/tools/packet_size.bt
new file mode 100644
index 0000000000..c50cd98af7
--- /dev/null
+++ b/tools/packet_size.bt
@@ -0,0 +1,8 @@
+usdt:$MMTK:mmtk:process_edges {
+    if (@stats_enabled) {
+        @process_edges_packet_size = hist(arg0);
+        if (arg1) {
+            @process_edges_root_packet_size = hist(arg0);
+        }
+    }
+}
diff --git a/tools/prologue_with_harness.bt.fragment b/tools/prologue_with_harness.bt.fragment
new file mode 100644
index 0000000000..81f9156a39
--- /dev/null
+++ b/tools/prologue_with_harness.bt.fragment
@@ -0,0 +1,10 @@
+usdt:$MMTK:mmtk:harness_begin {
+    //begin collecting data at harness_begin (start of final iteration)
+    @stats_enabled = 1;
+}
+
+usdt:$MMTK:mmtk:harness_end {
+    //end data at harness_end (end of final iteration)
+    @stats_enabled = 0;
+    exit();
+}
diff --git a/tools/prologue_without_harness.bt.fragment b/tools/prologue_without_harness.bt.fragment
new file mode 100644
index 0000000000..3ab4b60f4c
--- /dev/null
+++ b/tools/prologue_without_harness.bt.fragment
@@ -0,0 +1,4 @@
+BEGIN {
+    //always collect data
+    @stats_enabled = 1;
+}
diff --git a/tools/run.py b/tools/run.py
new file mode 100755
index 0000000000..ab56d2de45
--- /dev/null
+++ b/tools/run.py
@@ -0,0 +1,57 @@
+#!/usr/bin/env python3
+from string import Template
+from argparse import ArgumentParser
+from pathlib import Path
+import tempfile
+import sys
+import os
+
+
+def get_args():
+    parser = ArgumentParser()
+    parser.add_argument("-b", "--bpftrace", type=str, default="bpftrace",
+                        help="Path of the bpftrace executable")
+    parser.add_argument("-m", "--mmtk", type=str, required=True,
+                        help="Path of the MMTk binary")
+    parser.add_argument("-H", "--harness", action="store_true",
+                        help="Only collect data for the timing iteration (harness_begin/harness_end)")
+    parser.add_argument("-p", "--print-script", action="store_true",
+                        help="Print the content of the bpftrace script")
+    parser.add_argument(
+        "-f", "--format", choices=["text", "json"], default="text", help="bpftrace output format")
+    parser.add_argument("tool", type=str, help="Name of the bpftrace tool")
+    return parser.parse_args()
+
+
+def main():
+    args = get_args()
+    here = Path(__file__).parent.resolve()
+    bpftrace_script = here / f"{args.tool}.bt"
+    if not bpftrace_script.exists():
+        print(f"Tracing script {str(bpftrace_script)} not found.")
+        sys.exit(1)
+    mmtk_bin = Path(args.mmtk)
+    if not mmtk_bin.exists():
+        print(f"MMTk binary {str(mmtk_bin)} not found.")
+        sys.exit(1)
+    prologue_file = here / \
+        "prologue_with_harness.bt.fragment" if args.harness else "prologue_without_harness.bt.fragment"
+    with prologue_file.open() as fd:
+        prologue = fd.read()
+    with (here / "epilogue.bt.fragment").open() as fd:
+        epilogue = fd.read()
+    with bpftrace_script.open() as fd:
+        template = Template(prologue + fd.read() + epilogue)
+        with tempfile.NamedTemporaryFile(mode="w+t") as tmp:
+            content = template.safe_substitute(
+                MMTK=mmtk_bin, TMP_FILE=tmp.name)
+            if args.print_script:
+                print(content)
+            tmp.write(content)
+            tmp.flush()
+            os.execvp("sudo", ["sudo", args.bpftrace,
+                      "--unsafe", "-f", args.format, tmp.name])
+
+
+if __name__ == "__main__":
+    main()

From d8814ef9e1e03b5b974184c5b4d4fa4af01b1702 Mon Sep 17 00:00:00 2001
From: Zixian Cai <u5937495@anu.edu.au>
Date: Fri, 11 Aug 2023 08:12:07 +0000
Subject: [PATCH 2/9] Escape tildes

---
 tools/README.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/tools/README.md b/tools/README.md
index 1c28505569..876bbd224c 100644
--- a/tools/README.md
+++ b/tools/README.md
@@ -130,7 +130,7 @@ Sample output:
 [2K, 4K)              15 |                                                    |
 ```
 
-In the above output, we can see that the lock instance (140637228007056, or 0x7fe8a8047e90) roughly has a bimodal distribution in terms of the time spent in lock contended code path. The first peak is around 512ns~1024ns, and the second peak is around 66us~131us.
+In the above output, we can see that the lock instance (140637228007056, or 0x7fe8a8047e90) roughly has a bimodal distribution in terms of the time spent in lock contended code path. The first peak is around 512ns\~1024ns, and the second peak is around 66us\~131us.
 If you can't tell which lock instance is for which lock in MMTk, you can trace the allocation of the Mutex and record the stack trace (note that you might want to compile MMTk with `force-frame-pointers` to obtain better stack traces).
 
 ### Measuring the distribution of `process_edges` packet sizes (`packet_size`)
@@ -174,7 +174,7 @@ Sample output:
 [16K, 32K)             3 |                                                    |
 ```
 
-In the above output, we can see that overall, the sizes of the `process_edges` has a unimodal distribution with a peak around 16~32 edges per packet. However, if we focus on root edges, the distribution is roughly bimodal, with a first peak around 8~16 and a second peak around 4096~8192.
+In the above output, we can see that overall, the sizes of the `process_edges` has a unimodal distribution with a peak around 16\~32 edges per packet. However, if we focus on root edges, the distribution is roughly bimodal, with a first peak around 8\~16 and a second peak around 4096\~8192.
 
 ## Attribution
 If used for research, please cite the following publication (the `BibTeX` record will be updated once a DOI is assigned).

From 93c5b490f5888c07e4e61cca46ed758459a40f54 Mon Sep 17 00:00:00 2001
From: Zixian Cai <u5937495@anu.edu.au>
Date: Fri, 11 Aug 2023 10:13:23 +0000
Subject: [PATCH 3/9] Address review comments

---
 tools/run.py | 36 +++++++++++++++++++++---------------
 1 file changed, 21 insertions(+), 15 deletions(-)

diff --git a/tools/run.py b/tools/run.py
index ab56d2de45..9bdccf1632 100755
--- a/tools/run.py
+++ b/tools/run.py
@@ -36,21 +36,27 @@ def main():
         sys.exit(1)
     prologue_file = here / \
         "prologue_with_harness.bt.fragment" if args.harness else "prologue_without_harness.bt.fragment"
-    with prologue_file.open() as fd:
-        prologue = fd.read()
-    with (here / "epilogue.bt.fragment").open() as fd:
-        epilogue = fd.read()
-    with bpftrace_script.open() as fd:
-        template = Template(prologue + fd.read() + epilogue)
-        with tempfile.NamedTemporaryFile(mode="w+t") as tmp:
-            content = template.safe_substitute(
-                MMTK=mmtk_bin, TMP_FILE=tmp.name)
-            if args.print_script:
-                print(content)
-            tmp.write(content)
-            tmp.flush()
-            os.execvp("sudo", ["sudo", args.bpftrace,
-                      "--unsafe", "-f", args.format, tmp.name])
+    prologue = prologue_file.read_text()
+    epilogue = (here / "epilogue.bt.fragment").read_text()
+    template = Template(prologue + bpftrace_script.read_text() + epilogue)
+    with tempfile.NamedTemporaryFile(mode="w+t") as tmp:
+        content = template.safe_substitute(
+            MMTK=mmtk_bin, TMP_FILE=tmp.name)
+        if args.print_script:
+            print(content)
+        tmp.write(content)
+        tmp.flush()
+        # We use execvp to replace the current process instead of creating
+        # a subprocess (or sh -c). This is so that when users invoke this from
+        # the command line, Ctrl-C will be captured by bpftrace instead of the
+        # outer Python script. The temporary file can then be cleaned up by
+        # the END probe in bpftrace.
+        # 
+        # In theory, you can implement this via pty, but it is very finicky
+        # and doesn't work reliably.
+        # See also https://github.com/anupli/running-ng/commit/b74e3a13f56dd97f73432d8a391e1d6cd9db8663
+        os.execvp("sudo", ["sudo", args.bpftrace,
+                           "--unsafe", "-f", args.format, tmp.name])
 
 
 if __name__ == "__main__":

From 8fd9b9c2111062f548965f0407309ac400a8e375 Mon Sep 17 00:00:00 2001
From: Zixian Cai <u5937495@anu.edu.au>
Date: Fri, 11 Aug 2023 10:18:48 +0000
Subject: [PATCH 4/9] Reflow the README

---
 tools/README.md | 98 ++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 73 insertions(+), 25 deletions(-)

diff --git a/tools/README.md b/tools/README.md
index 876bbd224c..ae6bd86b43 100644
--- a/tools/README.md
+++ b/tools/README.md
@@ -1,11 +1,14 @@
 # MMTk performace tracing
 
 ## Notes for MMTk developers
-Please open pull requests if you develop new tools that others might find useful. When you add new tools, please update this documentation.
-If you change MMTk internals that the tracing tools depend on (such as the definition of `enum WorkBucketStage`), please update the scripts accordingly.
+Please open pull requests if you develop new tools that others might find useful.
+When you add new tools, please update this documentation.
+If you change MMTk internals that the tracing tools depend on (such as the
+definition of `enum WorkBucketStage`), please update the scripts accordingly.
 
 ## Notes for MMTk users
-Since some of the tools depend on the MMTk internals, please use the tools shipped with the MMTk release you use.
+Since some of the tools depend on the MMTk internals, please use the tools
+shipped with the MMTk release you use.
 
 ## Tracepoints
 Currently, the core provides the following tracepoints.
@@ -16,10 +19,15 @@ Currently, the core provides the following tracepoints.
 - `mmtk:gcworker_run()`: a GC worker thread enters its work loop
 - `mmtk:gc_start()`: a collection epoch starts
 - `mmtk:gc_end()`: a collection epoch ends
-- `mmtk:process_edges(num_edges: int, is_roots: bool)`: a invocation of the `process_edges` method. The first argument is the number of edges to be processed, and the second argument is whether these edges are root edges.
-- `mmtk:bucket_opened(id: int)`: a work bucket opened. The first argument is the numerical representation of `enum WorkBucketStage`.
+- `mmtk:process_edges(num_edges: int, is_roots: bool)`: a invocation of the
+`process_edges` method. The first argument is the number of edges to be processed,
+and the second argument is whether these edges are root edges.
+- `mmtk:bucket_opened(id: int)`: a work bucket opened. The first argument is the
+numerical representation of `enum WorkBucketStage`.
 - `mmtk:work_poll()`: a work packet is to be polled.
-- `mmtk:work(type_name: char *, type_name_len: int)`: a work packet was just executed. The first argument is points to the string of the Rust type name of the work packet, and the second argument is the length of the string.
+- `mmtk:work(type_name: char *, type_name_len: int)`: a work packet was just
+executed. The first argument is points to the string of the Rust type name of
+the work packet, and the second argument is the length of the string.
 - `mmtk:alloc_slow_once_start()`: the allocation slow path starts.
 - `mmtk:alloc_slow_once_end()`: the allocation slow path ends.
 
@@ -42,18 +50,37 @@ optional arguments:
                         bpftrace output format
 ```
 
-- `-b`: the path to the `bpftrace` executable. By default, it uses `bpftrace` executable in your `PATH`. We strongly recommend you use the latest statically complied `bpftrace` from [upstream](https://github.com/iovisor/bpftrace/releases). You need to be able to have sudo permission for whichever `bpftrace` you want to use.
-- `-m`: the path to a MMTk binary that contains the tracepoints. This depends on the binding you use. For the OpenJDK binding, it should be `jdk/lib/server/libmmtk_openjdk.so` under your build folder. To check whether the binary contains tracepoints, you can use `readelf -n`. You should see a bunch of `stapsdt` notes with `mmtk` as the provider.
-- `-H`: pass this flag is you want to only measure the timing iteration of a benchmark. By default, the tracing tools will measure the entire execution.
-- `-p`: print the entire tracing script before execution. This is mainly for debugging use.
-- `-f`: change the bpftrace output format. By default, it uses human-readable plain text output (`text`). You can set this to `json` for easy parsing.
-
-Please run the tracing tools **before** running the workload. If you use `-H`, the tracing tools will automatically end with `harness_end` is called. Otherwise, you will need to terminate the tools manually with `Ctrl-C`. These tools also have a timeout of 1200 seconds so not to stall unattended benchmark execution. 
+- `-b`: the path to the `bpftrace` executable. By default, it uses `bpftrace`
+executable in your `PATH`. We strongly recommend you use the latest statically
+complied `bpftrace` from [upstream](https://github.com/iovisor/bpftrace/releases).
+You need to be able to have sudo permission for whichever `bpftrace` you want to use.
+- `-m`: the path to a MMTk binary that contains the tracepoints.
+This depends on the binding you use.
+For the OpenJDK binding, it should be `jdk/lib/server/libmmtk_openjdk.so` under
+your build folder.
+To check whether the binary contains tracepoints, you can use `readelf -n`.
+You should see a bunch of `stapsdt` notes with `mmtk` as the provider.
+- `-H`: pass this flag is you want to only measure the timing iteration of a
+benchmark.
+By default, the tracing tools will measure the entire execution.
+- `-p`: print the entire tracing script before execution.
+This is mainly for debugging use.
+- `-f`: change the bpftrace output format.
+By default, it uses human-readable plain text output (`text`).
+You can set this to `json` for easy parsing.
+
+Please run the tracing tools **before** running the workload.
+If you use `-H`, the tracing tools will automatically end with `harness_end` is
+called.
+Otherwise, you will need to terminate the tools manually with `Ctrl-C`.
+These tools also have a timeout of 1200 seconds so not to stall unattended
+benchmark execution. 
 
 ## Tracing tools
 ### Measuring the time spend in allocation slow path (`alloc_slow`)
 This tool measures the distribution of the allocation slow path time.
-The time unit is ns / 400, so that we use the histogram bins with higher fidelity better.
+The time unit is ns / 400, so that we use the histogram bins with higher
+fidelity better.
 
 Sample output:
 ```
@@ -76,10 +103,13 @@ Sample output:
 [128K, 256K)           1 |                                                    |
 ```
 
-In the above output, we can see that most allocation slow paths finish between 3.2us and 6.4us. However, there is a long tail, presumably due to GC pauses.
+In the above output, we can see that most allocation slow paths finish between
+3.2us and 6.4us.
+However, there is a long tail, presumably due to GC pauses.
 
 ### Measuring the time spend in different GC stages (`gc_stages`)
-This tool measures the time spent in different stages of GC: before `Closure`, during `Closure`, and after `Closure`.
+This tool measures the time spent in different stages of GC: before `Closure`,
+during `Closure`, and after `Closure`.
 The time unit is ns.
 
 Sample output:
@@ -89,10 +119,15 @@ Sample output:
 @pre_closure_time: 103886118
 ```
 
-In the above output, overall, the execution spends 1.4s in the main transitive closure, 103ms before that, and 81ms after that (a total of around 1.5s).
+In the above output, overall, the execution spends 1.4s in the main transitive
+closure, 103ms before that, and 81ms after that (a total of around 1.5s).
 
 ### Measuring the time spend in lock contended state for Rust `Mutex` (`lock_contend`)
-This tools measures the time spent in the lock contended state for Rust `Mutex`s. The Rust standard library implements `Mutex` using the fast-slow-path paradigm. Most lock operations take place in inlined fast paths, when there's no contention. However, when there's contention, `std::sys::unix::locks::futex_mutex::Mutex::lock_contended` is called.
+This tools measures the time spent in the lock contended state for Rust `Mutex`s.
+The Rust standard library implements `Mutex` using the fast-slow-path paradigm.
+Most lock operations take place in inlined fast paths, when there's no contention.
+However, when there's contention,
+`std::sys::unix::locks::futex_mutex::Mutex::lock_contended` is called.
 
 ```rust
 #[inline]
@@ -109,7 +144,8 @@ fn lock_contended(&self) {
 ```
 
 
-MMTk uses Rust `Mutex`, e.g., in allocation slow paths for synchronization, and this tool can be useful to measure the contention in these parts of code.
+MMTk uses Rust `Mutex`, e.g., in allocation slow paths for synchronization,
+and this tool can be useful to measure the contention in these parts of code.
 
 The time unit is ns / 256.
 
@@ -130,12 +166,20 @@ Sample output:
 [2K, 4K)              15 |                                                    |
 ```
 
-In the above output, we can see that the lock instance (140637228007056, or 0x7fe8a8047e90) roughly has a bimodal distribution in terms of the time spent in lock contended code path. The first peak is around 512ns\~1024ns, and the second peak is around 66us\~131us.
-If you can't tell which lock instance is for which lock in MMTk, you can trace the allocation of the Mutex and record the stack trace (note that you might want to compile MMTk with `force-frame-pointers` to obtain better stack traces).
+In the above output, we can see that the lock instance (140637228007056, or 0x7fe8a8047e90)
+roughly has a bimodal distribution in terms of the time spent in lock contended
+code path.
+The first peak is around 512ns\~1024ns, and the second peak is around 66us\~131us.
+
+If you can't tell which lock instance is for which lock in MMTk, you can trace
+the allocation of the Mutex and record the stack trace (note that you might want
+to compile MMTk with `force-frame-pointers` to obtain better stack traces).
 
 ### Measuring the distribution of `process_edges` packet sizes (`packet_size`)
-Most of the GC time is spend in the transitive closure for tracing-based GCs, and MMTk performs transitive closure via work packets that calls the `process_edges` method.
-This tool measures the distribution of the sizes of these work packets, and also count root edges separately.
+Most of the GC time is spend in the transitive closure for tracing-based GCs,
+and MMTk performs transitive closure via work packets that calls the `process_edges` method.
+This tool measures the distribution of the sizes of these work packets, and also
+count root edges separately.
 
 Sample output:
 ```
@@ -174,10 +218,14 @@ Sample output:
 [16K, 32K)             3 |                                                    |
 ```
 
-In the above output, we can see that overall, the sizes of the `process_edges` has a unimodal distribution with a peak around 16\~32 edges per packet. However, if we focus on root edges, the distribution is roughly bimodal, with a first peak around 8\~16 and a second peak around 4096\~8192.
+In the above output, we can see that overall, the sizes of the `process_edges`
+has a unimodal distribution with a peak around 16\~32 edges per packet.
+However, if we focus on root edges, the distribution is roughly bimodal, with a
+first peak around 8\~16 and a second peak around 4096\~8192.
 
 ## Attribution
-If used for research, please cite the following publication (the `BibTeX` record will be updated once a DOI is assigned).
+If used for research, please cite the following publication (the `BibTeX` record
+will be updated once a DOI is assigned).
 ```bibtex
 @inproceedings{conf/mplr/Huang23,
   author       = {Claire Huang and

From dc7eda41231789dd5441b98496bc0701dcd9d8ab Mon Sep 17 00:00:00 2001
From: Zixian Cai <u5937495@anu.edu.au>
Date: Fri, 11 Aug 2023 10:19:33 +0000
Subject: [PATCH 5/9] Fix time units

---
 tools/README.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/tools/README.md b/tools/README.md
index ae6bd86b43..ab368fd01d 100644
--- a/tools/README.md
+++ b/tools/README.md
@@ -79,7 +79,7 @@ benchmark execution.
 ## Tracing tools
 ### Measuring the time spend in allocation slow path (`alloc_slow`)
 This tool measures the distribution of the allocation slow path time.
-The time unit is ns / 400, so that we use the histogram bins with higher
+The time unit is 400ns, so that we use the histogram bins with higher
 fidelity better.
 
 Sample output:
@@ -147,7 +147,7 @@ fn lock_contended(&self) {
 MMTk uses Rust `Mutex`, e.g., in allocation slow paths for synchronization,
 and this tool can be useful to measure the contention in these parts of code.
 
-The time unit is ns / 256.
+The time unit is 256ns.
 
 Sample output:
 ```

From 0783ee98df5523e5857f03c724f678c46e5cc874 Mon Sep 17 00:00:00 2001
From: Zixian Cai <u5937495@anu.edu.au>
Date: Fri, 11 Aug 2023 10:21:35 +0000
Subject: [PATCH 6/9] Move tracing tools into its own subfolder

---
 tools/{ => tracing}/README.md                            | 0
 tools/{ => tracing}/alloc_slow.bt                        | 0
 tools/{ => tracing}/epilogue.bt.fragment                 | 0
 tools/{ => tracing}/gc_stages.bt                         | 0
 tools/{ => tracing}/lock_contended.bt                    | 0
 tools/{ => tracing}/packet_size.bt                       | 0
 tools/{ => tracing}/prologue_with_harness.bt.fragment    | 0
 tools/{ => tracing}/prologue_without_harness.bt.fragment | 0
 tools/{ => tracing}/run.py                               | 0
 9 files changed, 0 insertions(+), 0 deletions(-)
 rename tools/{ => tracing}/README.md (100%)
 rename tools/{ => tracing}/alloc_slow.bt (100%)
 rename tools/{ => tracing}/epilogue.bt.fragment (100%)
 rename tools/{ => tracing}/gc_stages.bt (100%)
 rename tools/{ => tracing}/lock_contended.bt (100%)
 rename tools/{ => tracing}/packet_size.bt (100%)
 rename tools/{ => tracing}/prologue_with_harness.bt.fragment (100%)
 rename tools/{ => tracing}/prologue_without_harness.bt.fragment (100%)
 rename tools/{ => tracing}/run.py (100%)

diff --git a/tools/README.md b/tools/tracing/README.md
similarity index 100%
rename from tools/README.md
rename to tools/tracing/README.md
diff --git a/tools/alloc_slow.bt b/tools/tracing/alloc_slow.bt
similarity index 100%
rename from tools/alloc_slow.bt
rename to tools/tracing/alloc_slow.bt
diff --git a/tools/epilogue.bt.fragment b/tools/tracing/epilogue.bt.fragment
similarity index 100%
rename from tools/epilogue.bt.fragment
rename to tools/tracing/epilogue.bt.fragment
diff --git a/tools/gc_stages.bt b/tools/tracing/gc_stages.bt
similarity index 100%
rename from tools/gc_stages.bt
rename to tools/tracing/gc_stages.bt
diff --git a/tools/lock_contended.bt b/tools/tracing/lock_contended.bt
similarity index 100%
rename from tools/lock_contended.bt
rename to tools/tracing/lock_contended.bt
diff --git a/tools/packet_size.bt b/tools/tracing/packet_size.bt
similarity index 100%
rename from tools/packet_size.bt
rename to tools/tracing/packet_size.bt
diff --git a/tools/prologue_with_harness.bt.fragment b/tools/tracing/prologue_with_harness.bt.fragment
similarity index 100%
rename from tools/prologue_with_harness.bt.fragment
rename to tools/tracing/prologue_with_harness.bt.fragment
diff --git a/tools/prologue_without_harness.bt.fragment b/tools/tracing/prologue_without_harness.bt.fragment
similarity index 100%
rename from tools/prologue_without_harness.bt.fragment
rename to tools/tracing/prologue_without_harness.bt.fragment
diff --git a/tools/run.py b/tools/tracing/run.py
similarity index 100%
rename from tools/run.py
rename to tools/tracing/run.py

From 3eaef2aa9405c9149deebe50cc64576a3e5ce82f Mon Sep 17 00:00:00 2001
From: Zixian Cai <2891235+caizixian@users.noreply.github.com>
Date: Mon, 14 Aug 2023 16:43:00 +1000
Subject: [PATCH 7/9] Update tools/tracing/run.py

Co-authored-by: Kunshan Wang <wks1986@gmail.com>
---
 tools/tracing/run.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/tracing/run.py b/tools/tracing/run.py
index 9bdccf1632..c30224e9e2 100755
--- a/tools/tracing/run.py
+++ b/tools/tracing/run.py
@@ -35,7 +35,7 @@ def main():
         print(f"MMTk binary {str(mmtk_bin)} not found.")
         sys.exit(1)
     prologue_file = here / \
-        "prologue_with_harness.bt.fragment" if args.harness else "prologue_without_harness.bt.fragment"
+        ("prologue_with_harness.bt.fragment" if args.harness else "prologue_without_harness.bt.fragment")
     prologue = prologue_file.read_text()
     epilogue = (here / "epilogue.bt.fragment").read_text()
     template = Template(prologue + bpftrace_script.read_text() + epilogue)

From 5289fc4dd2c97e2a23bae8b5887e818d35a9febe Mon Sep 17 00:00:00 2001
From: Zixian Cai <2891235+caizixian@users.noreply.github.com>
Date: Mon, 14 Aug 2023 16:43:09 +1000
Subject: [PATCH 8/9] Update tools/tracing/lock_contended.bt

Co-authored-by: Kunshan Wang <wks1986@gmail.com>
---
 tools/tracing/lock_contended.bt | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/tracing/lock_contended.bt b/tools/tracing/lock_contended.bt
index 044ed61b58..04f8bcaf4e 100644
--- a/tools/tracing/lock_contended.bt
+++ b/tools/tracing/lock_contended.bt
@@ -4,7 +4,7 @@ uprobe:$MMTK:_ZN3std3sys4unix5locks11futex_mutex5Mutex14lock_contended17he467d89
     }
 }
 
-uretprobe:$MMTK:_ZN3std3sys4unix5locks11futex_mutex5Mutex14lock_contended17he467d89501d7157aE {
+uretprobe:$MMTK:_ZN3std3sys4unix5locks11futex_mutex5Mutex14lock_contended* {
     if (@stats_enabled) {
         @lock_dist[@lock_nsecs[tid].0] = hist((nsecs - @lock_nsecs[tid].1)/256);
     }

From c5b368b1f8c50245485affd4e2c12a9c2362eb15 Mon Sep 17 00:00:00 2001
From: Zixian Cai <2891235+caizixian@users.noreply.github.com>
Date: Mon, 14 Aug 2023 16:43:15 +1000
Subject: [PATCH 9/9] Update tools/tracing/lock_contended.bt

Co-authored-by: Kunshan Wang <wks1986@gmail.com>
---
 tools/tracing/lock_contended.bt | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/tracing/lock_contended.bt b/tools/tracing/lock_contended.bt
index 04f8bcaf4e..4afb7984a2 100644
--- a/tools/tracing/lock_contended.bt
+++ b/tools/tracing/lock_contended.bt
@@ -1,4 +1,4 @@
-uprobe:$MMTK:_ZN3std3sys4unix5locks11futex_mutex5Mutex14lock_contended17he467d89501d7157aE {
+uprobe:$MMTK:_ZN3std3sys4unix5locks11futex_mutex5Mutex14lock_contended* {
     if (@stats_enabled) {
         @lock_nsecs[tid] = (arg0, nsecs);
     }