Skip to content

A Postgres extension in C to compute statistics on arrays of numbers

License

Notifications You must be signed in to change notification settings

pjungwir/aggs_for_arrays

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

60 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

aggs_for_arrays

This Postgres extension provides various functions for operating on arrays, for instance taking the histogram of an array of numbers.

These functions are useful because if you have a lot values you want to aggregate, queries that fetch each value from a separate row can have poor performance. Storing all the values in a single row as a Postgres array can drastically improve query performance. For instance, computing a 1000-bucket histogram on one million float values stored in separate rows took 12 seconds in a simple benchmark, compared to 27 milliseconds with the array_to_hist function.

Using arrays in this way is a bit like a poor-man's column-store database: it lets you keep all the values for one attribute in one place. (It's also a bit like how in R and Pandas you often see parallel arrays rather than arrays of objects.) To simplify a little, imagine pulling one million floats off disk in a single 8MB chunk instead of asking the drive for one million separate reads. I wouldn't use this pattern if your arrays get updated a lot, lest you take a hit on writes (At least test first!), but if they are pretty stable then using arrays can greatly speed up reads.

With such an approach you could still use SQL or PLPGSQL, but these functions outperform such code, because the Postgres C API lets you skip a lot of the work for interfacing at those higher levels. For instance, the same benchmark gave 398ms for a SQL solution and 12 seconds for a plpgsql solution. We show further benchmark results below.

Note that despite the name, these functions are not true aggregate functions (summarizing multiple rows). Rather they do aggregate-like calculations on a single input array. If you want actual aggregates that take multiple input arrays, then you might be looking for my other extension, aggs_for_vecs. If this extension takes a column-store approach to your data, that one takes a row-store approach.

Installing

This package installs like any Postgres extension. First say:

make && sudo make install

You will need to have pg_config in your path, but normally that is already the case. You can check with which pg_config.

Then in the database of your choice say:

CREATE EXTENSION aggs_for_arrays;

Functions

The available functions are described below. In general, these functions accept arrays of any integer or floating-point type, namely SMALLINT, INTEGER, BIGINT, REAL, or DOUBLE PRECISION (aka FLOAT). The return value will either be the same type (e.g. for a minimum), a FLOAT (e.g. for a mean), or an INTEGER type (e.g. for histogram bucket counts). If a function can take any numeric type, its types are shown as T.

INTEGER[] array_to_hist(values T[], bucket_start T, bucket_width T, bucket_count INTEGER)

Returns the bucket count based on the values and bucket characteristics you request.

INTEGER[] array_to_hist_2d(x_values T[], y_values T[], x_bucket_start T, y_bucket_start T, x_bucket_width T, y_bucket_width T, x_bucket_count INTEGER, y_bucket_count INTEGER)

Returns the bucket count as a 2-D array based on the values and bucket characteristics you request. The data arrays x_values and y_values must be the same length. We compare each array's first element and plot it, then their second element, etc. If either x_values or y_values is NULL, the whole result is NULL. If either contains a NULL, then that position isn't plotted.

FLOAT array_to_mean(values T[])

Returns the mean of all the values in the array.

FLOAT array_to_median(values T[])

Returns the median. Does not require a pre-sorted input. If there are an even number of values, returns the mean of the two middle values.

FLOAT sorted_array_to_median(values T[])

Just like array_to_median, but assumes values is already sorted.

FLOAT array_to_mode(values T[])

Returns the mode. Does not require a pre-sorted input. If there are several values tied for most common, returns their mean.

FLOAT sorted_array_to_mode(values T[])

Just like array_to_mode, but assumes values is already sorted.

FLOAT array_to_percentile(values T[], percentile FLOAT)

Returns the percentile you request, where percentile is a number from 0 to 1 inclusive. Asking for 0 will always give the minimum, 1 for maximum, and 0.5 the median. If you ask for a percentile that lands between two data points, we return a linear interpolation between them.

FLOAT sorted_array_to_percentile(values T[], percentile FLOAT)

Just like array_to_percentile, but assumes values is already sorted.

FLOAT[] array_to_percentiles(values T[], percentiles FLOAT[])

Just like array_to_percentile, but you can pass several percentiles and get the result for each in a single call.

FLOAT[] sorted_array_to_percentiles(values T[], percentiles FLOAT[])

Just like array_to_percentiles, but assumes values is already sorted.

T array_to_max(values T[])

Returns the greatest value in the array.

T array_to_min(values T[])

Returns the least value in the array.

T[] array_to_min_max(values T[])

Returns a tuple with the min in position 1 and the max in position 2.

FLOAT array_to_skewness(values T[])

Computes the skewness of the given values.

FLOAT array_to_kurtosis(values T[])

Computes the kurtosis of the given values.

Benchmarks

Assume you have two tables:

CREATE TABLE samples (
  id INTEGER PRIMARY KEY,
  measurement_id INTEGER NOT NULL,
  value FLOAT NOT NULL
);

CREATE TABLE sample_groups {
  id INTEGER PRIMARY KEY,
  measurement_id INTEGER NOT NULL,
  values FLOAT[] NOT NULL
};

These tables store the same information, but samples stores each sample in a separate row, and sample_groups stores a whole group in just one row.

You can run bench.sh to test the performance of various approaches:

  • SQL on samples.
  • SQL on sample_groups.
  • PLPGSQL on sample_groups.
  • The aggs_for_arrays function on sample_groups.

The sorted_array_to_* methods use sorted_samples and sorted_sample_groups instead.

function SQL row-based SQL array-based PLPGSQL array-based aggs_for_arrays
array_to_hist 12218.1 ms 398.235 ms 12310.800 ms 26.936 ms
array_to_mean 10630.0 ms 121.677 ms 390.983 ms 25.226 ms
array_to_median 33587.0 ms 1163.070 ms 1258.160 ms 47.996 ms
sorted_array_to_median 23239.5 ms 30.107 ms 41.225 ms 14.835 ms
array_to_mode 13724.1 ms 1505.310 ms 1552.610 ms 201.943 ms
sorted_array_to_mode 13195.2 ms 1474.130 ms 1577.770 ms 45.171 ms
array_to_percentile 24218.2 ms 2591.240 ms 1698.570 ms 179.916 ms
sorted_array_to_percentile 24305.5 ms 2102.520 ms 1204.140 ms 21.947 ms
array_to_percentiles 32367.0 ms 10735.300 ms 3608.800 ms 188.752 ms
sorted_array_to_percentiles 32294.3 ms 10153.300 ms 3120.830 ms 22.227 ms
array_to_max 10613.2 ms 115.094 ms 398.791 ms 17.321 ms
array_to_min 10600.5 ms 113.859 ms 400.926 ms 17.204 ms
array_to_min_max 10727.9 ms 169.226 ms 824.539 ms 23.922 ms
array_to_skewness 22267.2 ms 802.463 ms 1077.630 ms 120.925 ms
array_to_kurtosis 22253.1 ms 806.296 ms 1075.960 ms 112.210 ms

Development

These tests follow the PGXS and pg_regress framework used for Postgres extensions, including Postgres's own contrib package. To run the tests, first install the extension somewhere then say make installcheck. You can use standard libpq envvars to control the database connection, e.g. PGPORT=5436 make installcheck.

About

A Postgres extension in C to compute statistics on arrays of numbers

Resources

License

Stars

Watchers

Forks

Packages

No packages published