Parallel ML benchmark suite for the mpl
compiler.
MaPLe (MPL) is an extension of the MLton compiler for Standard ML which implements support for nested (fork-join) parallelism. MPL generates executables with excellent multicore performance, utilizing a novel approach to memory management based on the theory of disentanglement [1,2,3,4]. MPL is research software and is being actively developed.
This suite consists of sophisticated parallel benchmarks from various problem domains, including graphs, text processing, digital audio processing, image analysis and manipulation, numerical algorithms, computational geometry, and others. These are ported to MPL from existing state-of-the-art parallel C/C++ benchmark suites and libraries including PBBS, ParlayLib, Ligra, and PAM. Even though these benchmarks were originally written in C/C++, all of them are naturally disentangled.
All benchmarks are "problem-based", i.e., the input and output are well-specified, allowing us to do cross-language comparisons. This repo includes comparisons with Java, Go, Multicore OCaml, and C++. We also compare against MLton (on which MPL is based) to determine overall overheads and scalability.
Please see also mpllib
, which
is a library of key data structures and algorithms used throughout the MPL
benchmarks.
Comparison with Java on a few benchmarks, with runtimes
Comparison with Go on a few benchmarks, with runtimes
Speedup of MPL over MLton on up to 72 processors across 30 benchmarks. This is on a 72-core server (4x 2.4GHz Intel E7-8867 v4 Xeons) with 1TB of memory.
Machine: Linux x86_64 multicore
Software requirements:
- GCC version 9 (or later)
- MLton version 20210117
- mpl-switch
- smlpkg
- jq
- numactl (for C++, Java, Go experiments)
- OCaml experiments only:
- OCaml version
5.0.0+trunk
opam
version2.1.0
domainslib
version0.4.2
dune
- OCaml version
- Java experiments only:
- Java version 11
- Go experiments only:
- Go version 1.18
First clone the repository:
$ git clone https://github.com/MPLLang/parallel-ml-bench
$ cd parallel-ml-bench
Then, run the init script. This should take approximately 15 minutes. It
installs all necessary versions of mpl
and generates inputs. Note that
the generated inputs require approximately 5GB of disk space.
$ ./init
After completing the setup (described above), you can run all of the
experiments by passing a comma-separated list of processors to the run
script.
For example, here we run on 1, 4, and 8 processors:
$ ./run --procs 1,4,8
This command begins by compiling all of the benchmarks, and then runs each one-by-one. This will take a long time. Depending on how many processors you use, it could take multiple hours.
A progress indicator is printed at the beginning of each run which shows how
many commands are left to run. For example, [5/100]
means that this is the
5th benchmark out of 100 total benchmarks to run.
Results are written to a timestamped file in the results/
directory. The
timestamp format is YYMMDD-HHMMSS
. Each line of a results file is a
JSON object with the parameters and results from that run.
You may terminate the run
script early with Ctrl-C
. All results obtained
so far will still be available in the results file.
After using the run
script as described above, you can run the following
script to generate tables, speedup plots, etc. This will print out info at
the command line, and also generate data in the figures/
directory.
$ ./report
Infrastructure and benchmarks
- update inputs (need words256, input graphs, etc.)
- report script (speedup plots, summary table, etc.)
- benchmarks documentation
- MPL: infrastructure
- MPL: benchmarks
- C++: infrastructure
- C++: more benchmarks
- Java: infrastructure
- Java: more benchmarks
- OCaml: infrastructure
- OCaml: more benchmarks
- Go: infrastructure
- Go: more benchmarks
Performance optimization
- Try z-sort for nearest neighbors.
- Try leaf-based nearest neighbors. Would this improve performance for neighbors component of Delaunay, too?
- Compare direction-optimizing BFS with C++ Ligra
- Check delayed-seq performance vs hand-optimized for delayed-seq benchmarks
[1] Hierarchical Memory Management for Parallel Programs. Ram Raghunathan, Stefan K. Muller, Umut A. Acar, and Guy Blelloch. ICFP 2016.
[2] Hierarchical Memory Management for Mutable State. Adrien Guatto, Sam Westrick, Ram Raghunathan, Umut Acar, and Matthew Fluet. PPoPP 2018.
[3] Disentanglement in Nested-Parallel Programs. Sam Westrick, Rohan Yadav, Matthew Fluet, and Umut A. Acar. POPL 2020.
[4] Provably Space-Efficient Parallel Functional Programming. Jatin Arora, Sam Westrick, and Umut A. Acar. POPL 2021.