-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
The ORC format is a columnar format that enables faster and more efficient data access schemes such as column selection and indexing. We used this within the Timber ingestion pipeline and saw a roughly 90% performance increase over standard text or CSV files. Our tests, as well as other benchmarks show ORC to have slightly better performance over parquet.
Best Practices
There are a few best practices for this format in the context of logs that were rigorously tested as part of the Timber pipeline development:
- A good default for the index step size is 10,000 records.
- Compressing ORC files with LZ4 (or gzip) has significant performance and size improvements since S3 data must be transferred over a network before being processed.
- While timestamp sorted data takes better advantage of ORC's indexes, we found that it was not necessary for the logging use case. This is due to the fact that log data is typically in-order as it is received. Strict ordering is not required to build useful indexes.
- Don't worry about bloom filters for this first version.
Implementation
The implementation for this feature is going to be interesting. I couldn't find a Rust crate for writing ORC data, and the official ORC library is written in Java, so I think we have 3 options:
- Attempt to follow the ORC spec and write a very rudimentary format that only supports the above requirements (does not concern itself with bloom filters and other irrelevant intricacies).
- Contract someone to do the above.
- Pass on supporting this feature.
I will say that recreating this in Rust would be an interesting performance comparison given that this is used within high volume logging pipelines.