PBC(Pattern-Based Compression) is a fast lossless compression algorithm, which specifically targets patterns in machine-generated data to achieve Pareto-optimality in most cases. Unlike traditional data block-based methods, PBC compresses data on a per-record basis, facilitating rapid random access. The specific technical details are introduced in paper "High-Ratio Compression for Machine-Generated Data" which has been accepted by SIGMOD 2024 and has been published in arxiv: https://arxiv.org/pdf/2311.13947.pdf
For reference, several fast compression algorithms were tested and compared on a machine with an Intel(R) Xeon(R) Platinum 8369B CPU @ 2.90GHz and 755GB main memory running Linux kernel version 4.19.91. Android and Apache dataset are logs and urls dataset contains different urls. Here we show only a subset of our experiments and details can be seen in paper "High-Ratio Compression for Machine-Generated Data". For a comprehensive set of benchmarks and a detailed analysis of the results, please refer to our paper.
Dataset name | Compressor Name | Compress Ratio | Compression Speed (MB/s) | Decompression Speed (MB/s) |
---|---|---|---|---|
Android | FSST | 0.576 | 261.78 | 2096.87 |
Android | LZ4 | 0.560 | 30.92 | 1282.84 |
Android | Zstd | 0.543 | 18.62 | 284.01 |
Android | PBC | 0.347 | 60.40 | 3231.56 |
Android | PBC_FSST | 0.245 | 53.13 | 1580.05 |
Apache | FSST | 0.322 | 320.72 | 3039.89 |
Apache | LZ4 | 0.349 | 31.31 | 1773.38 |
Apache | Zstd | 0.411 | 12.07 | 343.56 |
Apache | PBC | 0.151 | 48.85 | 3140.39 |
Apache | PBC_FSST | 0.104 | 43.32 | 1909.66 |
urls | FSST | 0.413 | 195.89 | 1807.98 |
urls | LZ4 | 0.456 | 22.15 | 1247.63 |
urls | Zstd | 0.611 | 11.35 | 158.91 |
urls | PBC | 0.299 | 63.67 | 2029.16 |
urls | PBC_FSST | 0.248 | 55.11 | 1043.43 |
PBC can utilize other compression encoder to further compress the data that has already been compressed by PBC. Currently supported compression algorithms include FSE, FSST, and ZSTD. Depending on the compression algorithm used, they are referred to as PBC_ONLY(only use pbc), PBC_FSE, PBC_FSST, and PBC_ZSTD, respectively.
Here we give a quick start about how to use pbc. You can refer to the codes in directory example.
Requirements:
We recommend using Ubuntu 22.04 as the base environment. You can pull the image using:
docker pull ubuntu:22.04
Once inside the container, run the following commands to install the necessary dependencies:
apt-get update
apt install -y build-essential cmake clang lldb llvm-dev lld libboost-all-dev
These packages are essential for building the project.
And then, please follow these steps to build the project:
- Build and install PBC
cd pbc
bash ./build.sh -r # build pbc, default is Debug version, -r means Release version,
bash ./run_tests.sh # run pbc tests
./install_pbc.sh # install pbc library and header file
- Build with pbc library
cd example
clang++ pbc_train_pattern.cc -L/usr/local/lib/pbc -I/usr/local/include -lpbc -lpbc_fse -lpbc_fsst -lzstd -lhs -lpthread -o pbc_train_pattern
clang++ pbc_compress.cc -L/usr/local/lib/pbc -I/usr/local/include -lpbc -lpbc_fse -lpbc_fsst -lzstd -lhs -lpthread -o pbc_compress
Example code (example/pbc_train_pattern.cc, example/pbc_compress.cc) is provided.
- Run example
cp ../dataset/Apache Apache
./pbc_train_pattern Apache Apache.pat
./pbc_compress Apache Apache.pat
Please run all commands with sudo.
Terminal tool supports four operations: --train-pattern, --test-compress, --compress and --decompress.
- Train the pattern
./bin/pbc --train-pattern -i dataset/Apache -p dataset/Apache.pat --compress-method pbc_fsst --pattern-size 50 --train-data-number 1000 --train-thread-num 64
- Test Compress Ratio
./bin/pbc --test-compress -i dataset/Apache -p dataset/Apache.pat --compress-method pbc_fsst
- File Compress
./bin/pbc --compress -i dataset/Apache -p dataset/Apache.pat -o dataset/Apache.compress
- File Decompress
./bin/pbc --decompress -i dataset/Apache.compress -p dataset/Apache.pat -o dataset/Apache.origin
Usage: pbc [OPTIONS] [arg [arg ...]]
--help Output this help and exit.
--train-pattern -i <inputFile> -p <patternFile> [--compress-method <pbc_only/pbc_fse/pbc_fsst/pbc_zstd>] [--pattern-size <pattern_size>] [--train-data-number <train_data_number>] [--train-thread-num <train_thread_num>] [--varchar].
--test-compress -i <inputFile> -p <patternFile> [--compress-method <pbc_only/pbc_fse/pbc_fsst/pbc_zstd>] [--varchar].
-c/--compress -i <inputFile> -p <patternFile> [-o <outputFile>].
-d/--decompress -i <inputFile> -p <patternFile> [-o <outputFile>].
-i <inputFile> Input File, train-pattern/test-compress(not default), compress/decompress(default: stdin).
-p <patternFile> Pattern File, not default.
-o <outputFile> Output File, only effected when compress/decompress, default is stdout.
--compress-method Compress method, one of pbc_only, pbc_fse, pbc_fsst, pbc_zstd, default is pbc_only.
--pattern-size The number of expected generate, default is 20.
--train-data-number The number of data used for training pattern, default is 500.
--train-thread-num The thread num used for training pattern, default is 16.
--varchar Data type of input file, only effected when train-pattern and test-compress, default is Record(split by '\n').
Examples:
pbc --train-pattern -i inputFile -p patternFile --compress-method pbc_fsst --pattern-size 50 --train-data-number 1000 --train-thread-num 64 --varchar
pbc --test-compress -i inputFile -p patternFile --compress-method pbc_fsst --varchar
pbc --compress -i inputFile -p patternFile -o outputFile
cat inputFile | pbc --compress -p patternFile > outputFile
pbc --decompress -i inputFile -p patternFile -o outputFile
cat inputFile | pbc --decompress -p patternFile > outputFile
Licensed under the Apache License, Version 2.0