Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Archive Compression Considerations #2

Closed
metachris opened this issue Aug 8, 2023 · 1 comment
Closed

Archive Compression Considerations #2

metachris opened this issue Aug 8, 2023 · 1 comment

Comments

@metachris
Copy link
Contributor

metachris commented Aug 8, 2023

Input data

  • 14h CSVs (with raw transactions + timestamp + hash)
  • Transactions: 1,514,668
  • Disk usage: 1.7G
filename                            entries      size
txs-2023-08-07-10-00.csv             15,965       19M
txs-2023-08-07-11-00.csv            106,435      144M
txs-2023-08-07-12-00.csv            117,599      131M
txs-2023-08-07-13-00.csv            117,184      143M
txs-2023-08-07-14-00.csv            126,056      121M
txs-2023-08-07-15-00.csv            125,871      131M
txs-2023-08-07-16-00.csv            124,732      135M
txs-2023-08-07-17-00.csv            122,725      133M
txs-2023-08-07-18-00.csv            117,119      126M
txs-2023-08-07-19-00.csv            113,833      127M
txs-2023-08-07-20-00.csv            109,858      125M
txs-2023-08-07-21-00.csv            105,749      121M
txs-2023-08-07-22-00.csv            112,109      114M
txs-2023-08-07-23-00.csv             99,433      101M

Compression

Method Level Size Ratio Runtime
lz4 9 841M 0.49 38s
lz4 12 840M 0.49 1m 55s
zip 6 644M 0.38 45s
zip 9 640M 0.38 1m 23s
zstd 3 580M 0.34 9s
zstd 14 578M 0.34 2m 45s
zstd 15 577M 0.34 3m 47s
zstd 16 524M 0.31 4m 45s

Summarizer Script

  • Runtime: 1m 12s
  • Parquet output size: 74M (using gzip compression)
@metachris
Copy link
Contributor Author

Going with zip for now, as it's compression is in the same ballpark as default zstd on this type of data, but is generally available (any user can just download the archive and extract it without requiring further software)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant