Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tracing tools and documentation #898

Merged
merged 12 commits into from
Aug 14, 2023
193 changes: 193 additions & 0 deletions tools/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,193 @@
# MMTk performace tracing

## Notes for MMTk developers
Please open pull requests if you develop new tools that others might find useful. When you add new tools, please update this documentation.
If you change MMTk internals that the tracing tools depend on (such as the definition of `enum WorkBucketStage`), please update the scripts accordingly.

## Notes for MMTk users
Since some of the tools depend on the MMTk internals, please use the tools shipped with the MMTk release you use.

## Tracepoints
Currently, the core provides the following tracepoints.
- `mmtk:collection_initialized()`: GC is enabled
- `mmtk:harness_begin()`: the timing iteration of a benchmark begins
- `mmtk:harness_end()`: the timing iteration of a benchmark ends
- `mmtk:gccontroller_run()`: the GC controller thread enters its work loop
- `mmtk:gcworker_run()`: a GC worker thread enters its work loop
- `mmtk:gc_start()`: a collection epoch starts
- `mmtk:gc_end()`: a collection epoch ends
- `mmtk:process_edges(num_edges: int, is_roots: bool)`: a invocation of the `process_edges` method. The first argument is the number of edges to be processed, and the second argument is whether these edges are root edges.
- `mmtk:bucket_opened(id: int)`: a work bucket opened. The first argument is the numerical representation of `enum WorkBucketStage`.
- `mmtk:work_poll()`: a work packet is to be polled.
- `mmtk:work(type_name: char *, type_name_len: int)`: a work packet was just executed. The first argument is points to the string of the Rust type name of the work packet, and the second argument is the length of the string.
- `mmtk:alloc_slow_once_start()`: the allocation slow path starts.
- `mmtk:alloc_slow_once_end()`: the allocation slow path ends.

## Running tracing tools
The tracing tools are to be invoked by a wrapper script `run.py`.
```
usage: run.py [-h] [-b BPFTRACE] -m MMTK [-H] [-p] [-f {text,json}] tool

positional arguments:
tool Name of the bpftrace tool

optional arguments:
-h, --help show this help message and exit
-b BPFTRACE, --bpftrace BPFTRACE
Path of the bpftrace executable
-m MMTK, --mmtk MMTK Path of the MMTk binary
-H, --harness Only collect data for the timing iteration (harness_begin/harness_end)
-p, --print-script Print the content of the bpftrace script
-f {text,json}, --format {text,json}
bpftrace output format
```

- `-b`: the path to the `bpftrace` executable. By default, it uses `bpftrace` executable in your `PATH`. We strongly recommend you use the latest statically complied `bpftrace` from [upstream](https://github.com/iovisor/bpftrace/releases). You need to be able to have sudo permission for whichever `bpftrace` you want to use.
- `-m`: the path to a MMTk binary that contains the tracepoints. This depends on the binding you use. For the OpenJDK binding, it should be `jdk/lib/server/libmmtk_openjdk.so` under your build folder. To check whether the binary contains tracepoints, you can use `readelf -n`. You should see a bunch of `stapsdt` notes with `mmtk` as the provider.
- `-H`: pass this flag is you want to only measure the timing iteration of a benchmark. By default, the tracing tools will measure the entire execution.
- `-p`: print the entire tracing script before execution. This is mainly for debugging use.
- `-f`: change the bpftrace output format. By default, it uses human-readable plain text output (`text`). You can set this to `json` for easy parsing.

Please run the tracing tools **before** running the workload. If you use `-H`, the tracing tools will automatically end with `harness_end` is called. Otherwise, you will need to terminate the tools manually with `Ctrl-C`. These tools also have a timeout of 1200 seconds so not to stall unattended benchmark execution.

## Tracing tools
### Measuring the time spend in allocation slow path (`alloc_slow`)
This tool measures the distribution of the allocation slow path time.
The time unit is ns / 400, so that we use the histogram bins with higher fidelity better.
wks marked this conversation as resolved.
Show resolved Hide resolved

Sample output:
```
@alloc_slow_hist:
[4, 8) 304 |@ |
[8, 16) 12603 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[16, 32) 8040 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[32, 64) 941 |@@@ |
[64, 128) 171 | |
[128, 256) 13 | |
[256, 512) 2 | |
[512, 1K) 0 | |
[1K, 2K) 0 | |
[2K, 4K) 0 | |
[4K, 8K) 0 | |
[8K, 16K) 0 | |
[16K, 32K) 14 | |
[32K, 64K) 37 | |
[64K, 128K) 19 | |
[128K, 256K) 1 | |
```

In the above output, we can see that most allocation slow paths finish between 3.2us and 6.4us. However, there is a long tail, presumably due to GC pauses.

### Measuring the time spend in different GC stages (`gc_stages`)
This tool measures the time spent in different stages of GC: before `Closure`, during `Closure`, and after `Closure`.
The time unit is ns.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A general style problem: I find that you never uses newline characters within one paragraph. Markdown allows this.

But here the text spans over three lines. I think you intend to introduce one title and two paragraphs. However, Markdown will interpret consecutive non-blank lines as one paragraph (except for titles (### blah blah blah)). If you intend to put "The time unit is ns." in a single paragraph, you need to add an empty line before it.

I also recommend formatting the text so that the line width is no more than 100 characters long (or 80 or 72 if you prefer). The fmt command line utility may come in handy. If you are a Vim user, you can set tw=100, select one or more paragraphs and press gq to format them. In this way, this text file will still be easy to read using text editors.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


Sample output:
```
@closure_time: 1405302743
@post_closure_time: 81432919
@pre_closure_time: 103886118
```

In the above output, overall, the execution spends 1.4s in the main transitive closure, 103ms before that, and 81ms after that (a total of around 1.5s).

### Measuring the time spend in lock contended state for Rust `Mutex` (`lock_contend`)
This tools measures the time spent in the lock contended state for Rust `Mutex`s. The Rust standard library implements `Mutex` using the fast-slow-path paradigm. Most lock operations take place in inlined fast paths, when there's no contention. However, when there's contention, `std::sys::unix::locks::futex_mutex::Mutex::lock_contended` is called.

```rust
#[inline]
pub fn lock(&self) {
if self.futex.compare_exchange(0, 1, Acquire, Relaxed).is_err() {
self.lock_contended();
}
}

#[cold]
fn lock_contended(&self) {
<snip>
}
```


MMTk uses Rust `Mutex`, e.g., in allocation slow paths for synchronization, and this tool can be useful to measure the contention in these parts of code.

The time unit is ns / 256.
wks marked this conversation as resolved.
Show resolved Hide resolved

Sample output:
```
@lock_dist[140637228007056]:
[1] 447 |@@@@ |
[2, 4) 3836 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[4, 8) 3505 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[8, 16) 1354 |@@@@@@@@@@@@@@ |
[16, 32) 832 |@@@@@@@@ |
[32, 64) 1077 |@@@@@@@@@@@ |
[64, 128) 2991 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[128, 256) 4846 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[256, 512) 5013 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[512, 1K) 1203 |@@@@@@@@@@@@ |
[1K, 2K) 34 | |
[2K, 4K) 15 | |
```

In the above output, we can see that the lock instance (140637228007056, or 0x7fe8a8047e90) roughly has a bimodal distribution in terms of the time spent in lock contended code path. The first peak is around 512ns\~1024ns, and the second peak is around 66us\~131us.
If you can't tell which lock instance is for which lock in MMTk, you can trace the allocation of the Mutex and record the stack trace (note that you might want to compile MMTk with `force-frame-pointers` to obtain better stack traces).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If they are two paragraphs, there needs to be an empty line in between.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.


### Measuring the distribution of `process_edges` packet sizes (`packet_size`)
Most of the GC time is spend in the transitive closure for tracing-based GCs, and MMTk performs transitive closure via work packets that calls the `process_edges` method.
This tool measures the distribution of the sizes of these work packets, and also count root edges separately.

Sample output:
```
@process_edges_packet_size:
[1] 238 |@@@@@ |
[2, 4) 806 |@@@@@@@@@@@@@@@@@ |
[4, 8) 1453 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[8, 16) 1105 |@@@@@@@@@@@@@@@@@@@@@@@ |
[16, 32) 2410 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[32, 64) 1317 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[64, 128) 1252 |@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[128, 256) 1131 |@@@@@@@@@@@@@@@@@@@@@@@@ |
[256, 512) 2017 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[512, 1K) 1270 |@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[1K, 2K) 1028 |@@@@@@@@@@@@@@@@@@@@@@ |
[2K, 4K) 874 |@@@@@@@@@@@@@@@@@@ |
[4K, 8K) 1024 |@@@@@@@@@@@@@@@@@@@@@@ |
[8K, 16K) 58 |@ |
[16K, 32K) 5 | |

@process_edges_root_packet_size:
[1] 71 |@@@@@@@ |
[2, 4) 4 | |
[4, 8) 276 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[8, 16) 495 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[16, 32) 477 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[32, 64) 344 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[64, 128) 242 |@@@@@@@@@@@@@@@@@@@@@@@@@ |
[128, 256) 109 |@@@@@@@@@@@ |
[256, 512) 31 |@@@ |
[512, 1K) 33 |@@@ |
[1K, 2K) 75 |@@@@@@@ |
[2K, 4K) 75 |@@@@@@@ |
[4K, 8K) 336 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[8K, 16K) 56 |@@@@@ |
[16K, 32K) 3 | |
```

In the above output, we can see that overall, the sizes of the `process_edges` has a unimodal distribution with a peak around 16\~32 edges per packet. However, if we focus on root edges, the distribution is roughly bimodal, with a first peak around 8\~16 and a second peak around 4096\~8192.

## Attribution
If used for research, please cite the following publication (the `BibTeX` record will be updated once a DOI is assigned).
```bibtex
@inproceedings{conf/mplr/Huang23,
author = {Claire Huang and
Stephen M. Blackburn and
Zixian Cai},
title = {Improving Garbage Collection Observability with Performance Tracing},
booktitle = {Proceedings of the 20th International Conference on Managed Programming
Languages and Runtimes, {MPLR} 2023, Cascais, Portugal, October
22, 2023},
publisher = {{ACM}},
year = {2023}
}
```
11 changes: 11 additions & 0 deletions tools/alloc_slow.bt
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
usdt:$MMTK:mmtk:alloc_slow_once_start {
if (@stats_enabled) {
@alloc_slow_nsecs[tid] = nsecs;
}
}

usdt:$MMTK:mmtk:alloc_slow_once_end {
if (@stats_enabled) {
@alloc_slow_hist = hist((nsecs - @alloc_slow_nsecs[tid])/400);
}
}
7 changes: 7 additions & 0 deletions tools/epilogue.bt.fragment
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
END {
system("rm $TMP_FILE");
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only need to delete the temp file if we use exec. See the comment on run.py.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above


interval:s:1200 {
exit();
}
28 changes: 28 additions & 0 deletions tools/gc_stages.bt
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
usdt:$MMTK:mmtk:gc_start {
if (@stats_enabled) {
@gc_start_nsecs = nsecs;
}
}

usdt:$MMTK:mmtk:gc_end {
if (@stats_enabled) {
@post_closure_time += nsecs - @post_closure_nsecs;
}
}

usdt:$MMTK:mmtk:bucket_opened {
if (@stats_enabled) {
$ns = nsecs;
// Please check enum WorkBucketStage for the numerical values of stages
// Closure is 2 when vo_bit is not set
if (arg0 == 2) {
@closure_nsecs = $ns;
@pre_closure_time += $ns - @gc_start_nsecs;
}
// Release is 14 when vo_bit is not set
if (arg0 == 14) {
@post_closure_nsecs = $ns;
@closure_time += $ns - @closure_nsecs;
}
}
}
11 changes: 11 additions & 0 deletions tools/lock_contended.bt
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
uprobe:$MMTK:_ZN3std3sys4unix5locks11futex_mutex5Mutex14lock_contended17he467d89501d7157aE {
if (@stats_enabled) {
@lock_nsecs[tid] = (arg0, nsecs);
}
}

uretprobe:$MMTK:_ZN3std3sys4unix5locks11futex_mutex5Mutex14lock_contended17he467d89501d7157aE {
if (@stats_enabled) {
@lock_dist[@lock_nsecs[tid].0] = hist((nsecs - @lock_nsecs[tid].1)/256);
}
}
8 changes: 8 additions & 0 deletions tools/packet_size.bt
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
usdt:$MMTK:mmtk:process_edges {
if (@stats_enabled) {
@process_edges_packet_size = hist(arg0);
if (arg1) {
@process_edges_root_packet_size = hist(arg0);
}
}
}
10 changes: 10 additions & 0 deletions tools/prologue_with_harness.bt.fragment
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
usdt:$MMTK:mmtk:harness_begin {
//begin collecting data at harness_begin (start of final iteration)
@stats_enabled = 1;
}

usdt:$MMTK:mmtk:harness_end {
//end data at harness_end (end of final iteration)
@stats_enabled = 0;
exit();
}
4 changes: 4 additions & 0 deletions tools/prologue_without_harness.bt.fragment
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
BEGIN {
//always collect data
@stats_enabled = 1;
}
57 changes: 57 additions & 0 deletions tools/run.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
#!/usr/bin/env python3
from string import Template
from argparse import ArgumentParser
from pathlib import Path
import tempfile
import sys
import os


def get_args():
parser = ArgumentParser()
parser.add_argument("-b", "--bpftrace", type=str, default="bpftrace",
help="Path of the bpftrace executable")
parser.add_argument("-m", "--mmtk", type=str, required=True,
help="Path of the MMTk binary")
parser.add_argument("-H", "--harness", action="store_true",
help="Only collect data for the timing iteration (harness_begin/harness_end)")
parser.add_argument("-p", "--print-script", action="store_true",
help="Print the content of the bpftrace script")
parser.add_argument(
"-f", "--format", choices=["text", "json"], default="text", help="bpftrace output format")
parser.add_argument("tool", type=str, help="Name of the bpftrace tool")
return parser.parse_args()


def main():
args = get_args()
here = Path(__file__).parent.resolve()
bpftrace_script = here / f"{args.tool}.bt"
if not bpftrace_script.exists():
print(f"Tracing script {str(bpftrace_script)} not found.")
sys.exit(1)
mmtk_bin = Path(args.mmtk)
if not mmtk_bin.exists():
print(f"MMTk binary {str(mmtk_bin)} not found.")
sys.exit(1)
prologue_file = here / \
"prologue_with_harness.bt.fragment" if args.harness else "prologue_without_harness.bt.fragment"
with prologue_file.open() as fd:
prologue = fd.read()
wks marked this conversation as resolved.
Show resolved Hide resolved
with (here / "epilogue.bt.fragment").open() as fd:
epilogue = fd.read()
with bpftrace_script.open() as fd:
template = Template(prologue + fd.read() + epilogue)
with tempfile.NamedTemporaryFile(mode="w+t") as tmp:
wks marked this conversation as resolved.
Show resolved Hide resolved
content = template.safe_substitute(
MMTK=mmtk_bin, TMP_FILE=tmp.name)
if args.print_script:
print(content)
tmp.write(content)
tmp.flush()
os.execvp("sudo", ["sudo", args.bpftrace,
"--unsafe", "-f", args.format, tmp.name])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python will automatically delete the NamedTemporaryFile when exiting the scope of with. However, execvp never returns. You can replace this with subprocess.run or subprocess.Popen.

An alternative to using temp file is using pipes. I tried piping the script into sudo bpftrace and it works. For example, I am currently using erb (from Ruby) for templating, and my command line looks like this:

erb record_every=50 so_path=/home/wks/projects/mmtk-github/openjdk/build/linux-x86_64-normal-server-release/jdk/lib/server/libmmtk_openjdk.so  gcvis.erb | sudo bpftrace - > openjdk-packet-size3-nopoll.log

In Python, the subprocess.run function (or popen.communicate) allows you to specify the stdin as a pipe, so you can directly pipe the Python string prologue + content + epilogue into sudo bpftrace.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a deliberate choice. I will add comments to explain this.



if __name__ == "__main__":
main()