Support `orc` columnar encoding format in `aws_s3` sink

The [ORC format](https://orc.apache.org/) is a columnar format that enables faster and more efficient data access schemes such as column selection and indexing. We used this within the Timber ingestion pipeline and saw a roughly 90% performance increase over standard text or CSV files. Our tests, [as well as other benchmarks](https://tech.marksblogg.com/billion-nyc-taxi-rides-aws-athena.html) show ORC to have slightly better performance over parquet.

## Best Practices

There are a few best practices for this format in the context of logs that were rigorously tested as part of the Timber pipeline development:

- [ ] A good default for the index step size is 10,000 records.
- [ ] Compressing ORC files with LZ4 (or gzip) has significant performance and size improvements since S3 data must be transferred over a network before being processed.
- [ ] While timestamp sorted data takes better advantage of ORC's indexes, we found that it was not necessary for the logging use case. This is due to the fact that log data is typically in-order as it is received. Strict ordering is not required to build useful indexes.
- [ ] Don't worry about bloom filters for this first version.

## Implementation

The implementation for this feature is going to be interesting. I couldn't find a Rust crate for _writing_ ORC data, and the official ORC library is written in Java, so I think we have 3 options:

1. Attempt to follow the [ORC spec](https://orc.apache.org/specification/) and write a very rudimentary format that only supports the above requirements (does not concern itself with bloom filters and other irrelevant intricacies).
2. Contract someone to do the above.
3. Pass on supporting this feature.

I will say that recreating this in Rust would be an interesting performance comparison given that this is used within high volume logging pipelines.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support `orc` columnar encoding format in `aws_s3` sink #1373

Best Practices

Implementation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support orc columnar encoding format in aws_s3 sink #1373

Description

Best Practices

Implementation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Support `orc` columnar encoding format in `aws_s3` sink #1373