Skip to content

Latest commit

 

History

History
708 lines (615 loc) · 32.2 KB

performance.md

File metadata and controls

708 lines (615 loc) · 32.2 KB

Performance

Disclaimer

One of frawk's goals is to be efficient. The abundance of such claims notwithstanding, I've found it is very hard to precisely state the degree to which an entire programming language implementation is or is not efficient. I have no doubt that there are programs in frawk that are slower than equivalent programs in Rust or C, or even mawk or gawk (indeed, see the "Group By Key" benchmark). In some cases this will be due to bugs or differences in the quality of the language implementations, in others it will be due to bugs or differences in the quality of the programs themselves, while in still others it may be due to the fact that in some nontrivial sense one language allows you to write the program more efficiently than another.

I've found that it can be very hard to distinguish between these three categories of explanations when doing performance analysis. As such, I encourage everyone reading this document to draw at most modest conclusions based on the numbers below. What I do hope this document demonstrates are some of frawk's strengths when it comes to some common scripting tasks on CSV and TSV data. (Awk and frawk can process other data formats as well, but in my experience larger files are usually in CSV, TSV, or some similar standardized format).

Benchmark Setup

All benchmark numbers report the minimum wall time across 5 iterations per hardware configuration, along with the composite system and user times for that invocation as reported by the time command. A computation running with wall time of 2.5 seconds, and CPU time of 10.2s for user and 3.4s for system is reported as "2.5s (10.2s + 3.4s)". We also report throughput numbers, which are computed as wall time divided by input file size.

This doc includes measurements for both parallel and serial invocations of frawk, and it also provides numbers for frawk using its LLVM and its Cranelift backend, with all optimizations enabled. Note that frawk adaptively chooses the number of workers to launch for parallel invocations, so the ratio of CPU to wall time can vary across invocations. XSV supports parallelism but I noticed no performance benefit from this feature without first building an index of the underlying CSV file (a process which can take longer than the benchmark task itself without amortizing the cost over multiple runs). Similarly, I do not believe it is possible to parallelize the other programs using generic tools like gnu parallel without first partitioning the data-set into multiple sub-files.

-itsv vs -F'\t'

frawk allows you to specify the input format as TSV using the itsv option, but it also provides support for traditional Awk field separators using -F or by setting the FS variable. These two are not the same; they end up invoking two completely separate parsers under the hood. -itsv looks for escape sequences (like \t) and converts them to their corresponding characters. -F'\t', on the other hand, does a more naive "split by tabs" algorithm. -F'\t' tends to perform faster than -itsv because parsing is less complex, and field size computations are easier to perform. The latter fact makes it easier to optimize scripts that do not parse a large suffix of columns in the input (e.g. only using columns 3 and 5 of a 30-column file). This is a feature of most of the benchmarks in this document, and, as I understand it, it's a big part of why tsv-utils is a consistent leader in performance here.

Because tsv-utils, mawk and gawk are all invoked without the extra escaping behavior, the frawk invocations all use -F'\t' on TSV inputs. If you are curious about how frawk performs using itsv but don't want to run these benchmarks yourself, it's my experience that frawk -itsv achieves similar (albeit slightly higher) throughput to using icsv on an equivalent CSV file.

Test Data

These benchmarks run on CSV and TSV versions of two data-sets:

  • all_train.csv from the HEPMASS dataset in the UCI machine learning repository. This is a 7 million-row file and both CSV and TSV versions of the file are roughly 5.2GB, uncompressed.
  • TREE_GRM_ESTN.csv from The Forest Service (data-sets available here). This file is a little over 36 million rows; the CSV version is 8.9GB, and the TSV version is 7.9GB.

Hardware

This doc reports performance numbers from two machines, a new (late 2019) MacOS laptop, and an older (mid-2016, Broadwell-based) workstation running Ubuntu. They're called out as "MacOS" and "Linux" not because I take these numbers to be benchmarks of those operating systems, but as a shorthand for the whole configuration, software and hardware, used in those benchmarks. As the Tools section makes clear, the versions of the tools I used are all slightly different. Comparisons within the same configuration should be safe while comparisons across configurations are less so.

MacOS (Newer Hardware) This is a 16-inch Macbook Pro running MacOS Big Sur 11.2.1 with an 8-core i9 clocked at 2.3GHz, boosting up to 4.8GHz. One thing to keep in mind about newer Apple hardware is that the SSDs are very fast: read throughput can top 2.5GB/s. None of these benchmarks are IO-bound on this machine, although various portions of the input files might be cached by the OS.

Linux (Older Hardware) This is a dual-socket workstation with 2 8-core Xeon 2620v4 CPUs clocked at 2.2 GHz or so and boosting up to 2.9GHz on a single core, with 64GB of RAM. Reads are serviced from an NVMe SSD that is pretty fast, but not as fast as the SSD in the Mac. I do not think any of the benchmarks are IO-bound on this machine. The machine is running Ubuntu 18.04 with a 4.15 kernel.

While the results are varied from benchmark to benchmark, I tend to find that while frawk has good performance overall, it does noticeably better on the newer hardware running MacOS. The absolute difference in these numbers are probably due to having a newer CPU with a much higher boost clock, but I am less sure about the relative performance differences between frawk and tsv-utils. One contributing factor might be frawk's use of AVX2 driving the clock rate down to a greater degree on the Broadwell-E CPU in Linux than the more recent CPU on the Mac configuration. frawk's LLVM backend generally performs better than Cranelift (this is as expected, as Cranelift does not implement as many optimizations as of yet), but the performance is usually pretty close.

Tools

These benchmarks report values for a number of tools, each with slightly different intended use-cases and strengths. Not all of the tools are set up well to handle each task, but each makes an appearance in some subset of the benchmarks. All benchmarks include frawk, of course; in this case both the use_jemalloc and allow_avx2 features were enabled.

  • mawk is an Awk implementation due to Mike Brennan. It is focused on the "Awk book" semantics for the language, and on efficiency. I used mawk v1.3.4 on MacOS and Linux. The Linux copy is compiled from source with -march=native.
  • gawk is GNU's Awk. It is actively maintained and includes several extensions to the Awk language (such as multidimensional arrays). It is the default Awk on many Linux machines. I used gawk v5.0.1 on MacOS and gawk v5.1 on Linux. All gawk invocations use the -b flag, which disables multibyte character support. This is a bit unfair to xsv, frawk, and tsv-utilities, which all support UTF-8 input and (e.g.) matching regular expressions in a UTF-8-aware manner. However, the benchmarks here do not make any special use of UTF-8, and gawk performs substantially better with this option enabled (sometimes >2x).
  • xsv is a command-line utility for performing common computations and transformations on CSV data. I used xsv v0.13.0 on both platforms
  • tsv-utils is a set of command-line utilities offering a similar feature-set to xsv, but with a focus on TSV data. I used tsv-utils v2.1.1_linux-x86_64_ldc2 downloaded from the repo for MacOS and Linux.

There are plenty of other tools (including other Awks) I could put here (see the benchmarks referenced on xsv and tsv-utils' pages for more), but I chose a small set of tools that could execute some subset of the benchmarks efficiently.

We benchmark on CSV and TSV data. Some utilities are not equipped for all possible configurations, for example:

  • tsv-utils is explicitly optimized for TSV and not CSV. We only run it on the TSV variants of these datasets, but note that the bundled csv2tsv takes longer -though not that much longer- than many of the benchmark workloads (28.7s for TREE_GRM_ESTN and 14.6s for all_train, with the new version v2.1.1).
  • Benchmarks for the TREE_GRM_ESTN dataset are not run on CSV input with mawk or gawk, as it contains quoted fields and these utilities do not handle CSV escaping out of the box. We still run them on the TSV versions of these files using \t as a field separator.
  • tsv-utils and xsv are both limited to a specific menu of common functions. When those functions cannot easily express the computation in question, we leave them out.

Scripts and output for these benchmarks are here. They do not include the raw data or the various binaries in question, but they should be straightforward to adapt for a given set of installs on another machine. Outputs from the Linux configuration lives are in the ".2" files.

Ad-hoc number-crunching

Before we start benchmarking purpose-built tools, I want to emphasize the power that Awk provides for basic number-crunching tasks. Not only can you efficiently combine common tasks like filtering, group-by, and arithmetic, but you can build complex expressions as well. Suppose I wanted to take the scaled, weighted sum of the first column, and the maximum of column 4 and column 5, when column 8 had a particular value. In Awk this is a pretty simple script:

function max(x,y) { return x<y?y:x; }
"GS" == $8 { accum += (0.5*$1+0.5*max($4+0,$5+0))/1000.0 }
END { print accum; }

Note: the +0s ensure we are always performing a numeric comparison. In Awk this is unnecessary if all instances of column $4 and column $5 are numeric; in frawk this is required: max will perform a lexicographic comparison instead of a numeric one without the explicit conversion.

But I have to preprocess the data to run Awk on it, as Awk doesn't properly support quoted CSV fields by default. Supposing I didn't want to do that, the next thing I'd reach for is a short script in Python:

import csv
import sys

def parse_float(s):
    try:
        return float(s)
    except ValueError:
        return 0.0

accum = 0
with open(sys.argv[1], "r") as f:
    csvr = csv.reader(f)
    for row in csvr:
        if row[7] == "GS":
            f1 = parse_float(row[0])
            f4 = parse_float(row[3])
            f5 = parse_float(row[4])
            accum += ((0.5*f1 + 0.5*max(f4,f5)) / 1000.0)
print(accum)

If I was concerned about the running time for the script, I would probably write the script in Rust instead rather than spending time trying to optimize the Python. Here's one I came up with quickly using the csv crate:

extern crate csv;

use std::fs::File;
use std::io::BufReader;

fn get_field_default_0(record: &csv::StringRecord, field: usize) -> f64 {
    match record.get(field) {
        Some(x) => x.parse().unwrap_or(0.0),
        None => 0.0,
    }
}

fn main() -> std::io::Result<()> {
    let args: Vec<String> = std::env::args().collect();
    let filename = &args[1];
    let f = File::open(filename)?;
    let br = BufReader::new(f);
    let mut csvr = csv::Reader::from_reader(br);
    let mut accum = 0f64;
    for record in csvr.records() {
        let record = match record {
            Ok(record) => record,
            Err(err) => {
                eprintln!("failed to parse record: {}", err);
                std::process::abort()
            }
        };
        if record.get(7) == Some("GS") {
            let f1: f64 = get_field_default_0(&record, 0);
            let f4: f64 = get_field_default_0(&record, 3);
            let f5: f64 = get_field_default_0(&record, 4);
            let max = if f4 < f5 { f5 } else { f4 };
            accum += (0.5 * f1 + 0.5 * max) / 1000f64
        }.
    }
    println!("{}", accum);
    Ok(())
}

This takes 8.2 seconds to compile for a release build. Depending on the setting this either is or isn't important. We'll leave it out for now, but keep in mind that frawk's and python's runtimes include the time it takes to compile a program.

MacOS

Program Format Running Time Throughput
Python CSV 2m48.7s (2m47.4s + 1.3s) 53.02 MB/s
Rust CSV 25.9s (24.8s + 1.1s) 345.57 MB/s
frawk (cranelift) CSV 19.9s (18.8s + 1.1s) 450.13 MB/s
frawk (cranelift, parallel) CSV 4.9s (23.2s + 1.2s) 1827.84 MB/s
frawk (llvm) CSV 19.6s (18.5s + 1.1s) 457.12 MB/s
frawk (llvm, parallel) CSV 4.9s (22.9s + 1.2s) 1842.90 MB/s

Linux

Program Format Running Time Throughput
Python CSV 2m17.1s (2m15.3s + 1.8s) 65.23 MB/s
Rust CSV 29.9s (28.4s + 1.6s) 299.05 MB/s
frawk (cranelift) CSV 29.4s (27.3s + 2.2s) 304.02 MB/s
frawk (cranelift, parallel) CSV 9.8s (35.5s + 3.5s) 908.81 MB/s
frawk (llvm) CSV 28.4s (26.6s + 1.8s) 315.46 MB/s
frawk (llvm, parallel) CSV 9.9s (35.2s + 3.8s) 905.22 MB/s

frawk is a good deal faster than the other options, particularly on the newer hardware, or when run in parallel. Now, the Rust script could of course be optimized substantially (frawk is implemented in Rust, after all). But if you need to do exploratory, ad-hoc computations like this, a frawk script is probably going to be faster than the first few Rust programs you write.

With that said, for many tasks you do not need a full programming language, and there are purpose-built tools for computing the particular value or transformation on the dataset. The rest of these benchmarks compare frawk and Awk to some of those.

Sum two columns

Sum columns 4 and 5 from TREE_GRM_ESTN and columns 6 and 18 for all_train

Programs:

  • Awk: -F{,\t} {sum1 += ${6,4}; sum2 += ${18,5};} END { print sum1,sum2}
  • frawk: -i{c,t}sv {sum1 += ${6,4}; sum2 += ${18,5};} END { print sum1,sum2}
  • tsv-utils: tsv-summarize -H --sum {6,4},{18,5}

Awk was only run on the TSV version of TREE_GRM_ESTN, as it has quote-escaped columns, tsv-utils was only run on TSV versions of both files, and xsv was not included because there is no way to persuade it to just compute a sum.

As can be seen below, frawk in parallel mode was the fastest utility in terms of wall-time on MacOS, but it is slightly slower on the Linux hardware; on a per-core basis frawk was faster than mawk and gawk but slower than tsv-utils on both configurations.

MacOS

Program Format Running Time (TREE_GRM_ESTN) Throughput (TREE_GRM_ESTN) Running Time (all_train) Throughput (all_train)
mawk CSV NA NA 10.8s (9.8s + 1.1s) 477.98 MB/s
mawk TSV 42.0s (40.5s + 1.5s) 187.81 MB/s 10.8s (9.8s + 1.0s) 478.38 MB/s
gawk CSV NA NA 10.4s (9.5s + 0.9s) 500.19 MB/s
gawk TSV 14.0s (12.7s + 1.3s) 562.60 MB/s 10.4s (9.6s + 0.9s) 496.07 MB/s
tsv-utils TSV 5.6s (5.0s + 0.6s) 1397.24 MB/s 2.7s (2.2s + 0.4s) 1953.39 MB/s
frawk (llvm) CSV 18.0s (16.9s + 1.1s) 496.91 MB/s 3.6s (3.0s + 0.6s) 1436.06 MB/s
frawk (llvm) TSV 10.0s (9.0s + 1.0s) 790.27 MB/s 3.1s (2.5s + 0.6s) 1650.23 MB/s
frawk (llvm, parallel) CSV 4.9s (23.1s + 1.3s) 1822.99 MB/s 1.8s (4.7s + 0.7s) 2846.86 MB/s
frawk (llvm, parallel) TSV 3.4s (12.4s + 1.0s) 2332.73 MB/s 1.7s (4.3s + 0.7s) 3065.98 MB/s
frawk (cranelift) CSV 18.2s (17.0s + 1.1s) 492.75 MB/s 3.7s (3.0s + 0.6s) 1405.66 MB/s
frawk (cranelift) TSV 10.3s (9.3s + 1.0s) 763.36 MB/s 3.1s (2.5s + 0.6s) 1652.34 MB/s
frawk (cranelift, parallel) CSV 4.9s (23.3s + 1.3s) 1807.89 MB/s 1.8s (4.7s + 0.6s) 2896.22 MB/s
frawk (cranelift, parallel) TSV 3.4s (12.4s + 1.0s) 2350.80 MB/s 1.7s (4.4s + 0.7s) 3071.43 MB/s

Linux

Program Format Running Time (TREE_GRM_ESTN) Throughput (TREE_GRM_ESTN) Running Time (all_train) Throughput (all_train)
mawk CSV NA NA 10.4s (9.3s + 1.1s) 498.07 MB/s
mawk TSV 54.9s (53.0s + 1.9s) 143.74 MB/s 10.5s (9.1s + 1.4s) 491.41 MB/s
gawk CSV NA NA 11.5s (10.2s + 1.3s) 451.24 MB/s
gawk TSV 23.1s (21.7s + 1.4s) 341.23 MB/s 11.4s (10.4s + 0.9s) 456.21 MB/s
tsv-utils TSV 7.5s (6.3s + 1.2s) 1047.75 MB/s 3.6s (2.9s + 0.7s) 1430.11 MB/s
frawk (llvm) CSV 26.0s (24.3s + 1.8s) 343.79 MB/s 5.4s (4.3s + 1.0s) 960.04 MB/s
frawk (llvm) TSV 14.7s (13.2s + 1.5s) 536.59 MB/s 4.6s (3.8s + 0.9s) 1119.42 MB/s
frawk (llvm, parallel) CSV 9.9s (34.1s + 4.6s) 899.67 MB/s 4.0s (9.0s + 2.5s) 1307.69 MB/s
frawk (llvm, parallel) TSV 8.1s (21.0s + 3.0s) 978.75 MB/s 3.5s (8.3s + 2.0s) 1462.83 MB/s
frawk (cranelift) CSV 26.0s (24.4s + 1.6s) 344.47 MB/s 5.3s (4.4s + 0.9s) 973.02 MB/s
frawk (cranelift) TSV 15.3s (13.8s + 1.4s) 516.87 MB/s 4.8s (3.9s + 0.9s) 1085.85 MB/s
frawk (cranelift, parallel) CSV 9.8s (33.7s + 4.4s) 913.45 MB/s 4.2s (8.7s + 2.2s) 1233.25 MB/s
frawk (cranelift, parallel) TSV 8.4s (25.1s + 5.3s) 943.75 MB/s 3.8s (8.6s + 2.6s) 1355.61 MB/s

Statistics

Collect the sum, mean, minimum, maximum, minimum length, maximum length and standard deviation of a numeric column, collect the maximum, minimum, maximum length and minimum length of a string column.

As in the "sum" benchmark, frawk and Awk have the same programs, but with frawk using the icsv and itsv options. This benchmark is meant to mirror xsv's summarize command. The xsv invocation is xsv summarize -s5,6 [-d\t]. The Awk program is more involved:

function min(x,y) { return x<y?x:y; }
function max(x,y) { return x<y?y:x; }
function step_sum(x) { SUM += x; }
function step_stddev(x, k,  xa2) { xa2 = (x - A) * (x - A); A = A + (x-A)/k; Q=Q+((k-1)/k)*xa2; }
NR==1  { h2 = $5; h1 = $6; }
NR > 1 {
    # f2 is numeric, f1 is a string
    f2=$5+0; f2Len = length($5);
    f1=$6; f1Len = length($6);
    if (NR==2) {
        min1=max1=f1;
        min2=max2=f2;
        min1L=max1L=f1Len;
        min2L=max2L=f2Len;
    } else {
        min1 = min(min1, f1)
        min2 = min(min2, f2)
        min1L = min(min1L, f1Len)
        min2L = min(min2L, f2Len)
        max1 = max(max1, f1)
        max2 = max(max2, f2)
        max1L = max(max1L, f1Len)
        max2L = max(max2L, f2Len)
    }
    step_sum(f2);
    step_stddev(f2, NR-1);
}
END {
    N=NR-1 # account for header
    print "field","sum","min","max","min_length","max_length","mean","stddev"
    print h2,SUM,min2,max2,min2L,max2L,(SUM/N), sqrt(Q/(N-1))
    print h1,"NA",min1,max1,min1L,max1L,"NA","NA"
}

I had to look up the algorithm for computing a running sample standard deviation on Wikipedia. If this is all I wanted to compute, I'd probably go with xsv. This script will not parallelize automatically: to get the correct answer we will need to (a) manually aggregate the min and max values, as they are not sums and (b) apply an algorithm (again, from Wikipedia) to combine the subcomputations of the empirical variance. With that we get the formidable:

function min(x,y) { return x<y?x:y; }
function max(x,y) { return x<y?y:x; }
function step_stddev(x, k,  xa2) { xa2 = (x - A) * (x - A); A = A + (x-A)/k; Q=Q+((k-1)/k)*xa2; }
BEGIN {
    getline;
    h2 = $5; h1 = $6;
}
{
    # f2 is numeric, f1 is a string
    f2=$5+0; f2Len = length($5);
    f1=$6; f1Len = length($6);
    if (num_records++) {
        min1 = min(min1, f1)
        min2 = min(min2, f2)
        min1L = min(min1L, f1Len)
        min2L = min(min2L, f2Len)
        max1 = max(max1, f1)
        max2 = max(max2, f2)
        max1L = max(max1L, f1Len)
        max2L = max(max2L, f2Len)
    } else {
        min1=max1=f1;
        min2=max2=f2;
        min1L=max1L=f1Len;
        min2L=max2L=f2Len;
    }
    SUM += f2;
    step_stddev(f2, num_records);
}
PREPARE {
    min1M[PID]  = min1
    min2M[PID]  = min2
    min1lM[PID] = min1L
    min2lM[PID] = min2L

    max1M[PID]  = max1
    max2M[PID]  = max2
    max1lM[PID] = max1L
    max2lM[PID] = max2L

    records[PID] = num_records
    sums[PID] = SUM
    # (sub)sample variance.
    m2s[PID] = (Q/(num_records-1))

    if (PID != 1) {
        min1L = min2L = max1L = max2L = 0;
        min2 = max2 = 0
    }
}
END {
      for (pid in records) {
        min1 = min(min1, min1M[pid])
        min2 = min(min2, min2M[pid])
        min1L = min(min1L, min1lM[pid]);
        min2L = min(min2L, min2lM[pid]);

        max1 = max(max1, max1M[pid])
        max2 = max(max2, max2M[pid])
        max1L = max(max1L, max1lM[pid]);
        max2L = max(max2L, max2lM[pid]);
    }

    for (pid in records) {
        nb = records[pid]
        sb = sums[pid]
        mb = sb / nb
        m2b = m2s[pid]
        if (i == 1) {
            na = nb; ma = mb; sa = sb; m2a = m2b;
        } else {
            delta = mb - ma;
            ma = (sa + sb) / (na + nb)
            sa += sums[k]
            m2a = ((na*m2a) + (nb*m2b))/(na+nb) + ((delta*delta*na*nb)/((na+nb)*(na+nb)))
            na += nb
        }

    }

    stddev = sqrt(m2a)

    print "field","sum","min","max","min_length","max_length","mean","stddev"
    print h2,SUM,min2,max2,min2L,max2L,(SUM/num_records), stddev
    print h1,"NA",min1,max1,min1L,max1L,"NA","NA"
}

The numeric portion of this benchmark took a little over 10 seconds on MacOS, and 15 seconds on Linux using tsv-utils, but I am omitting it from the table because it does not support the given summary statistics on string fields. All numbers are reported for TREE_GRM_ESTN.

As can be seen below, frawk performed this task more quickly than any of the other benchmark programs, though the race is pretty close with xsv on the Linux desktop using a single core and CSV format. Note that frawk is noticeably slower using Cranelift when compared with LLVM in this benchmark; the other benchmark programs show performance of the two backends much closer together.

MacOS

Program Format Running Time Throughput
gawk TSV 1m13.3s (1m11.9s + 1.4s) 107.69 MB/s
mawk TSV 1m12.6s (1m11.1s + 1.5s) 108.73 MB/s
frawk (cranelift) CSV 26.4s (25.3s + 1.1s) 338.97 MB/s
frawk (cranelift) TSV 18.8s (17.8s + 1.0s) 420.73 MB/s
frawk (cranelift, parallel) CSV 5.3s (29.0s + 1.3s) 1686.23 MB/s
frawk (cranelift, parallel) TSV 3.5s (16.6s + 1.0s) 2223.62 MB/s
frawk (llvm) CSV 20.0s (18.9s + 1.1s) 447.41 MB/s
frawk (llvm) TSV 12.5s (11.5s + 1.0s) 631.23 MB/s
frawk (llvm, parallel) CSV 5.1s (23.6s + 1.2s) 1762.30 MB/s
frawk (llvm, parallel) TSV 3.6s (16.4s + 1.1s) 2178.20 MB/s
xsv CSV 34.7s (33.6s + 1.1s) 258.14 MB/s
xsv TSV 32.8s (31.8s + 0.9s) 240.78 MB/s

Linux

Program Format Running Time Throughput
gawk TSV 1m14.4s (1m12.8s + 1.6s) 106.05 MB/s
mawk TSV 1m23.3s (1m21.4s + 1.9s) 94.75 MB/s
frawk (cranelift) CSV 39.0s (37.1s + 2.0s) 229.09 MB/s
frawk (cranelift) TSV 28.3s (26.7s + 1.6s) 278.42 MB/s
frawk (cranelift, parallel) CSV 9.9s (42.2s + 4.8s) 900.58 MB/s
frawk (cranelift, parallel) TSV 8.1s (32.3s + 5.1s) 979.96 MB/s
frawk (llvm) CSV 29.1s (27.4s + 1.7s) 307.05 MB/s
frawk (llvm) TSV 18.7s (17.0s + 1.7s) 422.67 MB/s
frawk (llvm, parallel) CSV 10.0s (35.9s + 3.4s) 891.33 MB/s
frawk (llvm, parallel) TSV 7.8s (26.3s + 3.7s) 1010.97 MB/s
xsv CSV 34.2s (32.5s + 1.7s) 261.54 MB/s
xsv TSV 31.7s (30.3s + 1.4s) 248.89 MB/s

Select

Select 3 fields from the all_train dataset.

This is a task that all the benchmark programs support. The Awk script looks like BEGIN { OFS={",","\t"} { print $1,$2,$8 }. all_train does not have any quoted fields, so gawk and mawk can use them with the -F, option. As before, frawk uses the icsv and itsv options.

The xsv invocation looks like xsv select [-d'\t'] 1,8,19, and the tsv-utils invocation is tsv-select -f1,8,19. All output is written to /dev/null, so all times surely underestimate the true running time of such an operation.

Note: We report parallel numbers for frawk here, but running the select script in parallel mode does not preserve the original ordering of the rows in the input file. While the ordering of some data-sets are not meaningful, the comparison is not apples-to-apples.

frawk performs this task slower than tsv-utils and faster than the other benchmark programs. While the gap between frawk in parallel mode and tsv-utils is probably in the noise on MacOS (it's still pretty large for the Linux configuration), tsv-utils unquestionably performs better per core than frawk, while preserving the input's row ordering.

MacOS

Program Format Running Time Throughput
gawk CSV 7.9s (7.0s + 0.9s) 657.00 MB/s
gawk TSV 7.9s (7.1s + 0.9s) 654.26 MB/s
mawk CSV 8.4s (7.4s + 1.0s) 612.83 MB/s
mawk TSV 8.4s (7.4s + 1.0s) 616.85 MB/s
frawk (cranelift) CSV 3.8s (3.3s + 0.7s) 1358.81 MB/s
frawk (cranelift) TSV 3.3s (2.8s + 0.7s) 1569.70 MB/s
frawk (cranelift, parallel) CSV 1.9s (5.1s + 0.7s) 2745.72 MB/s
frawk (cranelift, parallel) TSV 1.7s (4.6s + 0.7s) 2972.70 MB/s
frawk (llvm) CSV 3.8s (3.2s + 0.7s) 1373.59 MB/s
frawk (llvm) TSV 3.2s (2.7s + 0.7s) 1598.78 MB/s
frawk (llvm, parallel) CSV 1.9s (5.1s + 0.7s) 2678.96 MB/s
frawk (llvm, parallel) TSV 1.8s (4.6s + 0.7s) 2930.64 MB/s
tsv-utils TSV 2.0s (1.6s + 0.4s) 2589.22 MB/s
xsv CSV 5.4s (4.8s + 0.6s) 961.28 MB/s
xsv TSV 5.3s (4.7s + 0.6s) 970.65 MB/s

Linux

Program Format Running Time Throughput
gawk CSV 9.4s (8.3s + 1.1s) 550.72 MB/s
gawk TSV 9.4s (8.4s + 1.0s) 551.01 MB/s
mawk CSV 8.3s (7.0s + 1.3s) 621.14 MB/s
mawk TSV 8.1s (7.0s + 1.1s) 638.68 MB/s
frawk (cranelift) CSV 5.5s (4.8s + 1.0s) 942.39 MB/s
frawk (cranelift) TSV 4.7s (4.0s + 1.0s) 1100.39 MB/s
frawk (cranelift, parallel) CSV 4.0s (9.5s + 2.3s) 1294.29 MB/s
frawk (cranelift, parallel) TSV 3.5s (8.5s + 2.0s) 1476.18 MB/s
frawk (llvm) CSV 5.3s (4.6s + 1.0s) 978.36 MB/s
frawk (llvm) TSV 4.6s (4.0s + 0.9s) 1115.08 MB/s
frawk (llvm, parallel) CSV 4.1s (9.4s + 2.5s) 1259.35 MB/s
frawk (llvm, parallel) TSV 3.7s (8.8s + 2.2s) 1406.04 MB/s
tsv-utils TSV 2.9s (2.0s + 0.9s) 1756.00 MB/s
xsv CSV 8.1s (7.2s + 0.9s) 642.73 MB/s
xsv TSV 8.0s (7.2s + 0.8s) 646.66 MB/s

Filter

Print out all records from the all_train dataset matching a simple numeric filter on two of the columns

The Awk script computing this filter is $4 > 0.000024 && $16 > 0.3. Because all_train has no quoted fields, mawk and gawk can both run using the -F, and -F\t options. As before, frawk runs the same script with the icsv and itsv options. The tsv-utils invocation is tsv-filter -H --gt 4:0.000025 --gt 16:0.3 ./all_train.tsv (taken from the same benchmark on the tsv-utils repo). xsv does not support numeric filters, so it was omitted from this benchmark.

Note: We report parallel numbers for frawk here, but the same caveat highlighted in the select benchmark holds here. The comparison is not apples-to-apples because frawk will not preserve the input's ordering of the rows when run in parallel mode.

Again, frawk is slower than tsv-utils and faster than everything else when running serially. In parallel, frawk is slightly faster than tsv-utils in terms of wall time on MacOS, and a still good deal slower in the Linux configuration.

MacOS

Program Format Running Time Throughput
gawk CSV 8.2s (7.3s + 0.9s) 634.61 MB/s
gawk TSV 8.2s (7.4s + 0.9s) 630.52 MB/s
mawk CSV 10.1s (9.1s + 1.0s) 511.25 MB/s
mawk TSV 10.0s (9.0s + 1.0s) 517.69 MB/s
frawk (cranelift) CSV 3.7s (3.2s + 0.7s) 1390.93 MB/s
frawk (cranelift) TSV 3.1s (2.6s + 0.7s) 1670.46 MB/s
frawk (cranelift, parallel) CSV 1.9s (5.2s + 0.8s) 2684.52 MB/s
frawk (cranelift, parallel) TSV 1.7s (4.7s + 0.7s) 2969.29 MB/s
frawk (llvm) CSV 3.6s (3.1s + 0.7s) 1422.65 MB/s
frawk (llvm) TSV 3.1s (2.6s + 0.7s) 1684.04 MB/s
frawk (llvm, parallel) CSV 1.9s (5.2s + 0.8s) 2670.67 MB/s
frawk (llvm, parallel) TSV 1.8s (4.8s + 0.8s) 2873.72 MB/s
tsv-utils TSV 2.3s (1.9s + 0.4s) 2213.95 MB/s

Linux

Program Format Running Time Throughput
gawk CSV 9.4s (8.4s + 1.0s) 551.01 MB/s
gawk TSV 9.4s (8.4s + 0.9s) 553.78 MB/s
mawk CSV 10.1s (8.9s + 1.2s) 512.67 MB/s
mawk TSV 10.2s (9.1s + 1.1s) 507.39 MB/s
frawk (cranelift) CSV 5.3s (4.7s + 1.0s) 980.39 MB/s
frawk (cranelift) TSV 4.8s (3.8s + 1.3s) 1083.13 MB/s
frawk (cranelift, parallel) CSV 4.3s (9.5s + 2.8s) 1204.29 MB/s
frawk (cranelift, parallel) TSV 3.6s (8.5s + 2.3s) 1432.09 MB/s
frawk (llvm) CSV 5.2s (4.5s + 1.0s) 1002.21 MB/s
frawk (llvm) TSV 4.5s (4.0s + 0.9s) 1146.69 MB/s
frawk (llvm, parallel) CSV 4.4s (9.6s + 2.6s) 1180.40 MB/s
frawk (llvm, parallel) TSV 3.9s (8.8s + 2.3s) 1337.75 MB/s
tsv-utils TSV 3.4s (2.7s + 0.7s) 1532.53 MB/s

Group By Key

Print the mean of field 2 grouped by the value in field 6 for TREE_GRM_ESTN

The tsv-utils command was tsv-summarize -H --group-by 6 --mean 2. The Awk script, with the usual settings for gawk, mawk, and frawk, reads:

BEGIN { getline; }
{ N[$6]++; SUM[$6]+=$2; }
END {
    OFS="\t"
    for (k in N) {
        print k, ((SUM[k])/N[k]);
    }
}

This is a workload where gawk and frawk are very close in terms of single-threaded performance. Neither program is nearly as fast as TSV-utils though. Even if parallel frawk is able to catch up on MacOS, my guess is that there are serious opportunitities for optimization of arrays here, though part of the slowdown may be due to tsv-utils's superious handling of 'wide rows' where we only read a small number of columns.

MacOS

Program Format Running Time Throughput
gawk TSV 14.8s (13.5s + 1.3s) 534.48 MB/s
mawk TSV 42.1s (40.6s + 1.5s) 187.61 MB/s
frawk (cranelift) CSV 22.1s (21.0s + 1.1s) 405.69 MB/s
frawk (cranelift) TSV 15.1s (14.1s + 1.0s) 523.84 MB/s
frawk (cranelift, parallel) CSV 5.3s (29.5s + 1.3s) 1701.62 MB/s
frawk (cranelift, parallel) TSV 3.5s (16.6s + 1.0s) 2225.50 MB/s
frawk (llvm) CSV 21.8s (20.7s + 1.1s) 410.47 MB/s
frawk (llvm) TSV 14.8s (13.8s + 1.0s) 534.66 MB/s
frawk (llvm, parallel) CSV 5.2s (29.0s + 1.3s) 1720.27 MB/s
frawk (llvm, parallel) TSV 3.6s (16.5s + 1.0s) 2221.74 MB/s
tsv-utils TSV 4.9s (4.3s + 0.6s) 1614.16 MB/s

Linux

Program Format Running Time Throughput
gawk TSV 23.5s (21.9s + 1.6s) 335.70 MB/s
mawk TSV 50.6s (48.5s + 2.1s) 155.86 MB/s
frawk (cranelift) CSV 32.4s (30.5s + 1.9s) 276.18 MB/s
frawk (cranelift) TSV 22.9s (21.1s + 1.8s) 343.91 MB/s
frawk (cranelift, parallel) CSV 10.2s (41.5s + 5.5s) 881.24 MB/s
frawk (cranelift, parallel) TSV 7.9s (32.0s + 5.0s) 1001.35 MB/s
frawk (llvm) CSV 31.8s (30.1s + 1.7s) 281.08 MB/s
frawk (llvm) TSV 22.1s (20.5s + 1.6s) 356.80 MB/s
frawk (llvm, parallel) CSV 10.4s (42.5s + 6.0s) 859.72 MB/s
frawk (llvm, parallel) TSV 7.5s (27.3s + 2.4s) 1054.19 MB/s
tsv-utils TSV 5.9s (4.6s + 1.3s) 1333.27 MB/s