parallel-ml-bench

Parallel ML benchmark suite for the mpl compiler.

MaPLe (MPL) is an extension of the MLton compiler for Standard ML which implements support for nested (fork-join) parallelism. MPL generates executables with excellent multicore performance, utilizing a novel approach to memory management based on the theory of disentanglement [1,2,3,4]. MPL is research software and is being actively developed.

This suite consists of sophisticated parallel benchmarks from various problem domains, including graphs, text processing, digital audio processing, image analysis and manipulation, numerical algorithms, computational geometry, and others. These are ported to MPL from existing state-of-the-art parallel C/C++ benchmark suites and libraries including PBBS, ParlayLib, Ligra, and PAM. Even though these benchmarks were originally written in C/C++, all of them are naturally disentangled.

All benchmarks are "problem-based", i.e., the input and output are well-specified, allowing us to do cross-language comparisons. This repo includes comparisons with Java, Go, Multicore OCaml, and C++. We also compare against MLton (on which MPL is based) to determine overall overheads and scalability.

Please see also mpllib, which is a library of key data structures and algorithms used throughout the MPL benchmarks.

Sample of results

Java comparison

Comparison with Java on a few benchmarks, with runtimes $T_p$ (in seconds) and space usage (maximum residency, in GB) $R_p$ on $p$ processors. Column $J/M$ is the overhead of Java relative to MPL; higher ratios are better for MPL.

Go comparison

Comparison with Go on a few benchmarks, with runtimes $T_p$ (in seconds) and space usage $R_p$ (in GB) on $p$ processors. Column $G/M$ is the overhead of Go relative to MPL; higher ratios are better for MPL.

Speedups

Speedup of MPL over MLton on up to 72 processors across 30 benchmarks. This is on a 72-core server (4x 2.4GHz Intel E7-8867 v4 Xeons) with 1TB of memory.

Requirements

Machine: Linux x86_64 multicore

Software requirements:

GCC version 9 (or later)
MLton version 20210117
mpl-switch
smlpkg
jq
numactl (for C++, Java, Go experiments)
OCaml experiments only:
- OCaml version 5.0.0+trunk
- opam version 2.1.0
- domainslib version 0.4.2
- dune
Java experiments only:
- Java version 11
Go experiments only:
- Go version 1.18

Setup

First clone the repository:

$ git clone https://github.com/MPLLang/parallel-ml-bench
$ cd parallel-ml-bench

Then, run the init script. This should take approximately 15 minutes. It installs all necessary versions of mpl and generates inputs. Note that the generated inputs require approximately 5GB of disk space.

$ ./init

Run experiments

After completing the setup (described above), you can run all of the experiments by passing a comma-separated list of processors to the run script.

For example, here we run on 1, 4, and 8 processors:

$ ./run --procs 1,4,8

This command begins by compiling all of the benchmarks, and then runs each one-by-one. This will take a long time. Depending on how many processors you use, it could take multiple hours.

A progress indicator is printed at the beginning of each run which shows how many commands are left to run. For example, [5/100] means that this is the 5th benchmark out of 100 total benchmarks to run.

Results are written to a timestamped file in the results/ directory. The timestamp format is YYMMDD-HHMMSS. Each line of a results file is a JSON object with the parameters and results from that run.

You may terminate the run script early with Ctrl-C. All results obtained so far will still be available in the results file.

Generate report

After using the run script as described above, you can run the following script to generate tables, speedup plots, etc. This will print out info at the command line, and also generate data in the figures/ directory.

$ ./report

TODO

Infrastructure and benchmarks

Performance optimization

Try z-sort for nearest neighbors.
Try leaf-based nearest neighbors. Would this improve performance for neighbors component of Delaunay, too?
Compare direction-optimizing BFS with C++ Ligra
Check delayed-seq performance vs hand-optimized for delayed-seq benchmarks

References

[1] Hierarchical Memory Management for Parallel Programs. Ram Raghunathan, Stefan K. Muller, Umut A. Acar, and Guy Blelloch. ICFP 2016.

[2] Hierarchical Memory Management for Mutable State. Adrien Guatto, Sam Westrick, Ram Raghunathan, Umut Acar, and Matthew Fluet. PPoPP 2018.

[3] Disentanglement in Nested-Parallel Programs. Sam Westrick, Rohan Yadav, Matthew Fluet, and Umut A. Acar. POPL 2020.

[4] Provably Space-Efficient Parallel Functional Programming. Jatin Arora, Sam Westrick, and Umut A. Acar. POPL 2021.

Name		Name	Last commit message	Last commit date
Latest commit History 98 Commits
cpp		cpp
go		go
img		img
inputs		inputs
java		java
mpl		mpl
ocaml		ocaml
results		results
scripts		scripts
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
cpp-exp.json		cpp-exp.json
exp.json		exp.json
generate-inputs		generate-inputs
go-exp.json		go-exp.json
init		init
java-exp.json		java-exp.json
ocaml-exp.json		ocaml-exp.json
report		report
run		run

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

parallel-ml-bench

Sample of results

Java comparison

Go comparison

Speedups

Requirements

Setup

Run experiments

Generate report

TODO

References

About

Releases

Packages

Contributors 2

Languages

License

MPLLang/parallel-ml-bench

Folders and files

Latest commit

History

Repository files navigation

parallel-ml-bench

Sample of results

Java comparison

Go comparison

Speedups

Requirements

Setup

Run experiments

Generate report

TODO

References

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages