Skip to content

Benchmarks of the Juvix C runtime

Łukasz Czajka edited this page Nov 6, 2024 · 1 revision

Benchmarks of the Juvix C runtime

Benchmarked version: commit 148ececb4d4259eacbb980f5992073a3ac611d82 from 31.10.2022

Summary

We benchmark several programs manually compiled into the primitives of the new Juvix C runtime. The code corresponds closely to the code that will be generated by the new compilation process, with basic low-level optimisations (unboxing, untagging, etc.) but without any high-level optimisations on JuvixCore (inlining, specialisation, constant folding, fusion, etc.). This corresponds to the compilation process planned for the 0.4 milestone.

We compare the running time and memory usage with analogous programs written in Haskell, OCaml, JuvixCore (using the evaluator), old Juvix (with the "direct" transpilation to C) and C.

The results suggest that for most first-order programs the new compilation process will produce code with running time comparable to the code produced by the native OCaml compiler. For higher-order programs heavy on closure manipulation, the results are acceptable but noticeably worse, especially with third-order functions (i.e. functions which take functions taking functions). This could, however, be alleviated by implementing the specialisation optimisation (see the "specialised" column in the ackermann and mapfun benchmarks). Besides, functional programs of order higher than two are rare.

The comparisons with OCaml and Haskell were not entirely fair because the new Juvix runtime does not perform garbage collection. The overhead of garbage collection is particularly visible on the mergesort benchmark which creates many intermediate data structures that are quickly discarded. With proper memory management, the running time results on first-order programs for the new Juvix runtime are expected to become slightly worse than for the native OCaml compiler.

For simple programs operating on integers which don't require any heap memory allocation (fibonacci and combinations benchmarks), the direct transpilation to C in the old Juvix seems to perform best (behind only C). The reason is that for very simple programs `clang` can better optimise the output of such a direct transpiler. The main problem with the transpilation to C approach is that it cannot scale to reliably work for more complex programs, as evidenced by the segfaults, longer running time and higher memory use on other benchmarks.

In addition to the fibonacci and combinations benchmarks, the advantage of direct transpilation for very simple programs is also visible on the fold benchmark where a simple loop over a list dominates the running time. However, this is partly because the translation of closures in the old Juvix transpilation method is incorrect allowing it to be more efficient.

Benchmark programs

fibonacci

Compute the Nth Fibonacci number modulo 228 (N = 100’000’000)

The Nth Fibonacci number is computed in O(N). Needs only constant stack space and no heap memory. This benchmark tests the efficiency of tail recursion and arithmetic operations.

combinations

Count combinations of numbers 1 to N having sum N (N = 100)

This benchmark tests the efficiency of general recursion. No heap memory needs to be allocated. Uses stack space proportional to N. The running time is exponential in N.

prime

Compute the Nth prime (N = 16384)

The Nth prime number is computed via the Eratosthenes sieve. A list of N primes is created. No intermediate lists are discarded (garbage collection not needed). This benchmark tests the efficiency of tail recursion, arithmetic operations, list cell allocation and access.

mergesort

Merge sort a list of N integers (N = 2’000’000)

At each level of merge sort intermediate lists are created and discarded. The running time for this benchmark largely depends on the efficiency of memory management. Here one may observe the overhead of garbage collection or the memory blow-up if no garbage collection is used.

maybe

Optionally sum N integers from a binary tree K times (N = 220, K = 100)

If a fixed number k is encountered in the tree then the result is Nothing, otherwise it is Just sum. The computation is repeated for values of k from 0 to K. This tests the efficiency of handling optional values and data structure access.

fold

Fold a list of N integers K times (N = 100’000, K = 1000)

The sum of N natural numbers is computed via foldleft (tail-recursive). The computation is repeated K times. The list is created only once, so that allocation time does not dominate. This benchmark tests the efficiency of closure call and list cell access.

cps

Compute the Nth Fibonacci number modulo 228 with CPS (N = 100’000’000)

The function computing the Nth Fibonacci number is written in continuation-passing style, tail-recursively calling a continuation supplied as an argument. This benchmark tests the efficiency of closure call and allocation.

mapfold

Map and fold a list of N integers K times (N = 10000, K = 10000)

This benchmark tests the efficiency of standard higher-order functions on lists, closure call and memory management. The program allocates O(K) intermediate lists of length N which are quickly discarded.

ackermann

Compute Ack(3, N) with the higher-order Ackermann function definition (N = 11)

The higher-order Ackermann function definition iterates an iteration of function compositions. Hence, it uses a third-order invocation of an iteration function. This benchmark tests the efficiency of creating and calling second-order closures, and of partial application.

mapfun

Successively map K functions to a list of N integers (K = 100, N = 10000)

The benchmark stores K second-order closures in a list, maps them successively to a list of K closures, and then successively maps the K closures from the result to a list of N integers. This benchmark tests the efficiency of manipulating closures and storing them in data structures.

The benchmark programs can be found in tests/benchmark in the Juvix source directory.

Methodology

For each program the total running time (elapsed real time) and memory use (maximum resident set size) were measured on an M1 iMac with no significant background activity. Averages of several runs were taken. The variance was negligible, unless indicated otherwise by providing a range.

Results

fibonacci

Compute the Nth Fibonacci number modulo 228 (N = 100’000’000)

Juvix C runtime (native) Juvix C runtime (wasm32, wasmer) Juvix transpilation (native) Juvix transpilation (wasm32, wasmer) JuvixCore evaluator Haskell (native, ghc -O2) Haskell (native, ghc -XStrict -O2) OCaml (native, ocamlopt -O2) OCaml (bytecode) C (native, clang -O3) C (wasm32, clang -Os, wasmer)
Time (seconds, real) 0.26 0.35 0.35 0.23 13.15 10.03 0.39 0.35 0.94 0.16 0.22
Memory use (MB, max RSS) 1.5 3.8 1.3 8.8 21.3 8067.7 9.7 1.7 1.8 1.3 4.0

combinations

Count all combinations of numbers 1 to N having sum N (N = 1000)

Juvix C runtime (native) Juvix C runtime (wasm32, wasmer) Juvix transpilation (native) Juvix transpilation (wasm32, wasmer) JuvixCore evaluator Haskell (native, ghc -O2) Haskell (native, ghc -XStrict -O2) OCaml (native, ocamlopt -O2) OCaml (bytecode) C (native, clang -O3) C (wasm32, clang -Os, wasmer)
Time (seconds, real) 6.67 11.25 3.22 5.1 441.71 5.48 5.48 6.53 41.08 2.69 4.80
Memory use (MB, max RSS) 1.5 3.9 1.3 8.9 22.3 9.6 9.6 1.7 1.9 1.3 4.0

prime

Compute the Nth prime (N = 16384)

Juvix C runtime (native) Juvix C runtime (wasm32, wasmer) Juvix transpilation (native) Juvix transpilation (wasm32, wasmer) JuvixCore evaluator Haskell (native, ghc -O2) Haskell (native, ghc -XStrict -O2) OCaml (native, ocamlopt -O2) OCaml (bytecode) C (native, clang -O3) C (wasm32, clang -Os, wasmer)
Time (seconds, real) 1.52 1.91 segfault 3.09 167.04 3.85 3.85 1.68 14.82 0.12 0.13
Memory use (MB, max RSS) 1.7 4.0 segfault 9.3 24.4 9.8 9.6 2.2 2.2 1.4 4.0

mergesort

Merge sort a list of N integers (N = 2’000’000)

Juvix C runtime (native) Juvix C runtime (wasm32, wasmer) Juvix transpilation (native) Juvix transpilation (wasm32, wasmer) JuvixCore evaluator Haskell (native, ghc -O2) Haskell (native, ghc -XStrict -O2) OCaml (native, ocamlopt -O2) OCaml (bytecode) C (native, clang -O3) C (wasm32, clang -Os, wasmer)
Time (seconds, real) 0.40 0.31 3.55 1.32 22.45 2.86 2.90 1.95 3.52 0.15 0.15
Memory use (MB, max RSS) 1973.7 720.4 5046.7 2729.8 1728.9 253.6 253.6 172.6 343.1 24.4 26.8

maybe

Optionally sum N non-zero integers from a binary tree K times (N = 220, K = 100)

Juvix C runtime (native) Juvix C runtime (wasm32, wasmer) Juvix transpilation (native) Juvix transpilation (wasm32, wasmer) JuvixCore evaluator Haskell (native, ghc -O2) Haskell (native, ghc -XStrict -O2) OCaml (native, ocamlopt -O2) OCaml (bytecode) C (native, clang -O3) C (wasm32, clang -Os, wasmer)
Time (seconds, real) 0.45 0.64 3.29 1.57 22.75 5.58 0.59 0.30 3.57 0.27 0.50
Memory use (MB, max RSS) 1.6 3.8 2646.1 1320.9 22.4 5560.7 9.7 3.9 4.0 1.3 4.1

fold

Fold a list of N integers K times (N = 100’000, K = 1000)

Juvix C runtime (native) Juvix C runtime (wasm32, wasmer) Juvix transpilation (native) Juvix transpilation (wasm32, wasmer) JuvixCore evaluator Haskell (native, ghc -O2) Haskell (native, ghc -XStrict -O2) OCaml (native, ocamlopt -O2) OCaml (bytecode) C (native, clang -O3) C (wasm32, clang -Os, wasmer)
Time (seconds, real) 0.45 0.54 0.35 0.23 15.27 0.58 0.58 0.36 1.80 NA NA
Memory use (MB, max RSS) 3.1 4.6 4.4 10.6 43.4 12.7 12.7 5.9 5.9 NA NA

cps

Compute the Nth Fibonacci number modulo 228 with CPS (N = 100’000’000)

Juvix C runtime (native) Juvix C runtime (wasm32, wasmer) Juvix transpilation (native) Juvix transpilation (wasm32, wasmer) JuvixCore evaluator Haskell (native, ghc -O2) Haskell (native, ghc -XStrict -O2) OCaml (native, ocamlopt -O2) OCaml (bytecode) C (native, clang -O3) C (wasm32, clang -Os, wasmer)
Time (seconds, real) 0.43 0.52 1.56 stack overflow 20.22 10.04 0.39 0.35 1.60 0.16 0.25
Memory use (MB, max RSS) 1.5 3.9 1539.3 stack overflow 21.3 8067.7 9.7 1.7 1.8 1.3 4.0

mapfold

Map and fold a list of N integers K times (N = 10000, K = 10000)

Juvix C runtime (native) Juvix C runtime (wasm32, wasmer) Juvix transpilation (native) Juvix transpilation (wasm32, wasmer) JuvixCore evaluator Haskell (native, ghc -O2) Haskell (native, ghc -XStrict -O2) OCaml (native, ocamlopt -O2) OCaml (bytecode) C (native, clang -O3) C (wasm32, clang -Os, wasmer)
Time (seconds, real) 1.01 1.59 2.74 1.81 38.24 1.29 2.42 1.43 4.22 NA NA
Memory use (MB, max RSS) 2154.5 893.0 3059.1 1542.0 26.4 10.6 10.7 7.5 10-20 NA NA

ackermann

Compute Ack(3, N) with the higher-order Ackermann function definition (N = 11)

Juvix C runtime (native) Juvix C runtime (wasm32, wasmer) Juvix C runtime (specialised, native) Juvix C runtime (specialised, wasm32, wasmer) Juvix transpilation (native) Juvix transpilation (wasm32, wasmer) JuvixCore evaluator Haskell (native, ghc -O2) Haskell (native, ghc -XStrict -O2) OCaml (native, ocamlopt -O2) OCaml (bytecode) C (native, clang -O3) C (wasm32, clang -Os, wasmer)
Time (seconds, real) 0.92 1.21 0.30 0.65 segfault runtime error 11.71 0.87 0.47 0.54 1.35 0.00 0.14
Memory use (MB, max RSS) 2.6 4.1 2.3 3.9 segfault runtime error 23.3 13.6 9.6 2.0 3.6 1.3 4.0

mapfun

Successively map K functions to a list of N integers (K = 100, N = 10000)

Juvix C runtime (native) Juvix C runtime (wasm32, wasmer) Juvix C runtime (specialised, native) Juvix C runtime (specialised, wasm32, wasmer) Juvix transpilation (native) Juvix transpilation (wasm32, wasmer) JuvixCore evaluator Haskell (native, ghc -O2) Haskell (native, ghc -XStrict -O2) OCaml (native, ocamlopt -O2) OCaml (bytecode) C (native, clang -O3) C (wasm32, clang -Os, wasmer)
Time (seconds, real) 1.27 1.04 0.39 0.46 segfault runtime error 4.18 1.85 0.95 0.19 0.68 NA NA
Memory use (MB, max RSS) 3209.8 1229.7 21.8 13.2 segfault runtime error 33.0 13.6 11.6 5.3 7.9 NA NA

Comments

"Juvix C runtime" denotes C programs written using the primitives of the new Juvix runtime. These programs were "manually" compiled from the corresponding Juvix/JuvixCore programs, according to the new Juvix compilation concept. They correspond closely to the code that will be generated by the basic version of the new compilation process, without any high-level optimisations (inlining, specialisation, fusion, constant folding) but with basic low-level memory representation optimisations (unboxing, untagging, etc). This version of the new compilation process should be finished with the 0.4 milestone.

The "specialised" column for "Juvix C runtime" denotes a version of the corresponding "Juvix C runtime" benchmark program for which specialisation of higher-order functions was manually performed (to simulate the effects of the high-level specialisation optimisation).

"Juvix transpilation" denotes Juvix programs compiled with the old compilation process via a "direct" translation to C. For a fair comparison, all number operations were implemented using native binary C integers (exposed via foreign and compile blocks) without overflow check, instead of using the unary Nat from the standard library. For Haskell, we use the fixed-precision Int instead of the arbitrary-precision Integer.

For the simplest benchmark programs without heap memory allocation (e.g. fibonacci, combinations), the performance of "Juvix transpilation" is comparable to or better than that of "Juvix C runtime". This is because clang managed to eliminate (tail) recursion and optimise the code to essentially the same or better thing. The main problem with the old "direct" transpilation to C approach is that it cannot scale to reliably work for more complex programs. By "more complex" I mean larger program size, more functions, more complex patterns of recursion and/or the use of more functional programming features (including functional data structures). I don't mean higher computational complexity or more resource use.

The segfaults and runtime errors for "Juvix transpilation" are consequences of incorrectly generated code (compilation of partial application is not entirely correct) or stack overflows (when clang didn't figure out how to eliminate tail recursion).

The comparison with "Juvix transpilation" is not entirely fair for benchmarks that test the manipulation and calling of closures (e.g. fold). Juvix transpilation achieves good performance (when it doesn't segfault) at the expense of correctness: partial application is not compiled correctly and fixing this would require a fundamental change in closure representation.

The comparison with Haskell and OCaml compilers is not entirely fair, because the new Juvix C runtime does not perform garbage collection. With the GC overhead, I would expect the Juvix runtime results for native compilation of first-order programs to become a bit worse than the native OCaml versions. The GC overhead is particularly noticeable for the mergesort benchmark which creates many large intermediate lists. The memory usage of the Juvix runtime is much higher on this benchmark than the memory usage of OCaml or Haskell versions. The relatively small time difference between the OCaml native and bytecode versions of mergesort also indicates that GC accounts for a significant part of the running time.

Another small overhead will be introduced by bounds checking for integer operations. Currently, the Juvix C runtime operates on unboxed 31-bit (or 63-bit) integers without checking for integer overflow.

If we decide to default to transparent arbitrary-precision integers, then another small overhead will be introduced by the need to check the integer representation with each arithmetic operation.

Admittedly, the programs were deliberately written in a way to make high-level optimisations unnecessary, except specialisation for higher-order functions (mostly in ackermann and mapfun). This also explains the good performance of the OCaml native compiler which doesn't do much high-level optimisation.

In the "Juvix transpilation" and OCaml version of mergesort, to avoid stack overflow the merge function was written tail-recursively with accumulator reversal at the end. This is not necessary for the Juvix C runtime, because the stack is dynamically extended when needed.

As evidenced by the combinations benchmark, for non-tail-recursive direct calls our code performs worse than the code which uses the C / WebAssembly stack and function call mechanisms. However, in general it is impossible to directly use the C / WebAssembly stack and call mechanisms for a purely functional language. Since we dynamically allocate the stack segments when needed, stack overflow is impossible. This is convenient in an eager functional language. Otherwise, one needs to rewrite all functions operating on large data to use tail recursion. We pay for this convenience with a small overhead, which is the main reason for poorer performance on combinations where stack manipulation cost dominates.

Haskell's laziness seems to introduce more overhead than I expected. This would explain the comparatively better performance of the native OCaml compiler. The problem is particularly stark when Haskell's strictness analysis fails for some reason, as in the fibonacci benchmark. The second "Haskell" column with the "-XStrict" flag for GHC indicates the version of the benchmark compiled with strictness as the default.

The C versions of the programs were written to take advantage of C's imperative features, e.g., using arrays instead of lists, loops instead of recursion. No C versions are provided for some benchmarks designed to test specifically functional language features.

With the new Juvix runtime, the 32-bit WebAssembly version of mergesort is faster than the 64-bit native version because it needs roughly half as much memory (the word size is 4 bytes instead of 8). The difference is even starker between the WebAssembly and native versions of mergesort for "Juvix transpilation".