PBC - High-Ratio Compression for Machine-Generated Data

PBC(Pattern-Based Compression) is a fast lossless compression algorithm, which specifically targets patterns in machine-generated data to achieve Pareto-optimality in most cases. Unlike traditional data block-based methods, PBC compresses data on a per-record basis, facilitating rapid random access. The specific technical details are introduced in paper "High-Ratio Compression for Machine-Generated Data" which has been accepted by SIGMOD 2024 and has been published in arxiv: https://arxiv.org/pdf/2311.13947.pdf

Benchmarks

For reference, several fast compression algorithms were tested and compared on a machine with an Intel(R) Xeon(R) Platinum 8369B CPU @ 2.90GHz and 755GB main memory running Linux kernel version 4.19.91. Android and Apache dataset are logs and urls dataset contains different urls. Here we show only a subset of our experiments and details can be seen in paper "High-Ratio Compression for Machine-Generated Data". For a comprehensive set of benchmarks and a detailed analysis of the results, please refer to our paper.

Dataset name	Compressor Name	Compress Ratio	Compression Speed (MB/s)	Decompression Speed (MB/s)
Android	FSST	0.576	261.78	2096.87
Android	LZ4	0.560	30.92	1282.84
Android	Zstd	0.543	18.62	284.01
Android	PBC	0.347	60.40	3231.56
Android	PBC_FSST	0.245	53.13	1580.05
Apache	FSST	0.322	320.72	3039.89
Apache	LZ4	0.349	31.31	1773.38
Apache	Zstd	0.411	12.07	343.56
Apache	PBC	0.151	48.85	3140.39
Apache	PBC_FSST	0.104	43.32	1909.66
urls	FSST	0.413	195.89	1807.98
urls	LZ4	0.456	22.15	1247.63
urls	Zstd	0.611	11.35	158.91
urls	PBC	0.299	63.67	2029.16
urls	PBC_FSST	0.248	55.11	1043.43

PBC can utilize other compression encoder to further compress the data that has already been compressed by PBC. Currently supported compression algorithms include FSE, FSST, and ZSTD. Depending on the compression algorithm used, they are referred to as PBC_ONLY(only use pbc), PBC_FSE, PBC_FSST, and PBC_ZSTD, respectively.

Quickstart

Here we give a quick start about how to use pbc. You can refer to the codes in directory example.

Requirements:

We recommend using Ubuntu 22.04 as the base environment. You can pull the image using:

docker pull ubuntu:22.04

Once inside the container, run the following commands to install the necessary dependencies:

apt-get update
apt install -y build-essential cmake clang lldb llvm-dev lld libboost-all-dev

These packages are essential for building the project.

And then, please follow these steps to build the project:

Build and install PBC

cd pbc
bash ./build.sh -r       # build pbc, default is Debug version, -r means Release version,
bash ./run_tests.sh      # run pbc tests
./install_pbc.sh    # install pbc library and header file

Build with pbc library

cd example
clang++ pbc_train_pattern.cc -L/usr/local/lib/pbc -I/usr/local/include -lpbc -lpbc_fse -lpbc_fsst -lzstd -lhs -lpthread -o pbc_train_pattern
clang++ pbc_compress.cc -L/usr/local/lib/pbc -I/usr/local/include -lpbc -lpbc_fse -lpbc_fsst -lzstd -lhs -lpthread -o pbc_compress

Example code (example/pbc_train_pattern.cc, example/pbc_compress.cc) is provided.

Run example

cp ../dataset/Apache Apache
./pbc_train_pattern Apache Apache.pat
./pbc_compress Apache Apache.pat

Please run all commands with sudo.

Terminal tool

Terminal tool supports four operations: --train-pattern, --test-compress, --compress and --decompress.

Train the pattern

./bin/pbc --train-pattern -i dataset/Apache -p dataset/Apache.pat --compress-method pbc_fsst --pattern-size 50 --train-data-number 1000 --train-thread-num 64

Test Compress Ratio

./bin/pbc --test-compress -i dataset/Apache -p dataset/Apache.pat --compress-method pbc_fsst

File Compress

./bin/pbc --compress -i dataset/Apache -p dataset/Apache.pat -o dataset/Apache.compress

File Decompress

./bin/pbc --decompress -i dataset/Apache.compress -p dataset/Apache.pat -o dataset/Apache.origin

Detailed parameter usage:

Usage: pbc [OPTIONS] [arg [arg ...]]
  --help             Output this help and exit.
  --train-pattern -i <inputFile> -p <patternFile> [--compress-method <pbc_only/pbc_fse/pbc_fsst/pbc_zstd>] [--pattern-size <pattern_size>] [--train-data-number <train_data_number>] [--train-thread-num <train_thread_num>] [--varchar].
  --test-compress -i <inputFile> -p <patternFile> [--compress-method <pbc_only/pbc_fse/pbc_fsst/pbc_zstd>] [--varchar].
  -c/--compress -i <inputFile> -p <patternFile> [-o <outputFile>].
  -d/--decompress -i <inputFile> -p <patternFile> [-o <outputFile>].
  -i <inputFile>           Input File, train-pattern/test-compress(not default), compress/decompress(default: stdin).
  -p <patternFile>         Pattern File, not default.
  -o <outputFile>          Output File, only effected when compress/decompress, default is stdout.
  --compress-method        Compress method, one of pbc_only, pbc_fse, pbc_fsst, pbc_zstd, default is pbc_only.
  --pattern-size           The number of expected generate, default is 20.
  --train-data-number      The number of data used for training pattern, default is 500.
  --train-thread-num       The thread num used for training pattern, default is 16.
  --varchar                Data type of input file, only effected when train-pattern and test-compress, default is Record(split by '\n').

Examples:
  pbc --train-pattern -i inputFile -p patternFile --compress-method pbc_fsst --pattern-size 50 --train-data-number 1000 --train-thread-num 64 --varchar
  pbc --test-compress -i inputFile -p patternFile --compress-method pbc_fsst --varchar
  pbc --compress -i inputFile -p patternFile -o outputFile
  cat inputFile | pbc --compress -p patternFile > outputFile
  pbc --decompress -i inputFile -p patternFile -o outputFile
  cat inputFile | pbc --decompress -p patternFile > outputFile

License

Licensed under the Apache License, Version 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
dataset		dataset
example		example
scripts		scripts
src		src
third-party		third-party
.clang-format		.clang-format
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
build.sh		build.sh
clean_build.sh		clean_build.sh
install_pbc.sh		install_pbc.sh
integration_test.sh		integration_test.sh
run_example.sh		run_example.sh
run_tests.sh		run_tests.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PBC - High-Ratio Compression for Machine-Generated Data

Benchmarks

Quickstart

Terminal tool

Detailed parameter usage:

License

About

Releases

Packages

Contributors 4

Languages

License

antgroup/pbc

Folders and files

Latest commit

History

Repository files navigation

PBC - High-Ratio Compression for Machine-Generated Data

Benchmarks

Quickstart

Terminal tool

Detailed parameter usage:

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages