Pcodec

bar charts showing better compression for Pco than zstd parquet or blosc

Pcodec (or Pco) losslessly compresses and decompresses numerical sequences with high compression ratio and moderately fast speed.

Use cases include:

columnar data
long-term time series data
serving numerical data to web clients
low-bandwidth communication

Data types: u16, u32, u64, i16, i32, i64, f16, f32, f64

Get Started

Use the CLI (also supports benchmarking)

Use the Rust API

Use the Python API

How is Pco so much better than alternatives?

Pco is designed specifically for numerical data, whereas alternatives rely on general-purpose (LZ) compressors that target string or binary data. Pco uses a holistic, 3-step approach:

modes. Pco identifies an approximate structure of the numbers called a mode and then uses it to split numbers into "latents". As an example, if all numbers are approximately multiples of 777, int mult mode splits each number x into latent variables l_0 and l_1 such that x = 777 * l_0 + l_1. Most natural data uses classic mode, which simply matches x = l_0.
delta encoding. Pco identifies whether certain latent variables would be better compressed as deltas between consecutive elements (or deltas of deltas, or deltas with lookback). If so, it takes differences.
binning. This is the heart and most novel part of Pco. Pco represents each (delta-encoded) latent variable as an approximate, entropy-coded bin paired an exact offset into that bin. This nears the Shannon entropy of any smooth distribution very efficiently.

These 3 steps cohesively capture most entropy of numerical data without waste.

In contrast, LZ compressors are only effective for patterns like repeating exact sequences of numbers. Such patterns constitute just a small fraction of most numerical data's entropy.

Usage Details

Wrapped or Standalone

Pco is designed to embed into wrapping formats. It provides a powerful wrapped API with the building blocks to interleave it with the wrapping format. This is useful if the wrapping format needs to support things like nullability, multiple columns, random access, or seeking.

The standalone format is a minimal implementation of a wrapped format. It supports batched decompression only with no other niceties. It is mainly recommended for quick proofs of concept and benchmarking.

Granularity

Pco has a hierarchy of multiple batches per page; multiple pages per chunk; and multiple chunks per file.

	unit of ___	size for good compression
chunk	compression	>10k numbers
page	interleaving w/ wrapping format	>1k numbers
batch	decompression	256 numbers (fixed)

Mistakes to Avoid

You may get disappointing results from Pco if your data in a single chunk

combines semantically different sequences, or
is inherently 2D or higher.

Example: the NYC taxi dataset has f64 columns for fare and trip_miles. Suppose we assign these as fare[0...n] and trip_miles[0...n] respectively, where n=50,000.

separate chunk for each column => good compression
single chunk fare[0], ... fare[n-1], trip_miles[0], ... trip_miles[n-1] => bad compression
single chunk fare[0], trip_miles[0], ... fare[n-1], trip_miles[n-1] => bad compression

Extra

Docs

benchmarks: see the results

format specification

terminology

Quantile Compression: Pcodec's predecessor

contributing guide

Community

join the Discord

Name		Name	Last commit message	Last commit date
Latest commit History 350 Commits
.cargo		.cargo
.github/workflows		.github/workflows
better_io		better_io
docs		docs
dtype_dispatch		dtype_dispatch
images		images
pco		pco
pco_c		pco_c
pco_cli		pco_cli
pco_python		pco_python
quantile-compression		quantile-compression
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
rustfmt.toml		rustfmt.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pcodec

Get Started

How is Pco so much better than alternatives?

Usage Details

Wrapped or Standalone

Granularity

Mistakes to Avoid

Extra

Docs

Community

About

Releases 27

Contributors 10

Languages

License

mwlon/pcodec

Folders and files

Latest commit

History

Repository files navigation

Pcodec

Get Started

How is Pco so much better than alternatives?

Usage Details

Wrapped or Standalone

Granularity

Mistakes to Avoid

Extra

Docs

Community

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 27

Contributors 10

Languages