-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Benchmark of Arrow.jl vs Pyarrow (/Polars) #393
Comments
Wow! Thanks for the detailed research/investigation/writeup @svilupp! I think most of what you mentioned all sounds like things we should indeed do. I'm currently bogged down in a few other projects for the next month or two, but I'm hoping to then have some time to contribute more meaningfully to the package as the list of issues has grown faster than I've been able to keep up. I appreciate all the effort you've put in here and would more than welcome PRs to improve things. I usually have time to review things regardless of what else is going on. Looking forward to improving things! |
Oops, misclick |
I’ve already implemented most of the changes locally. I’ll post some benchmarks and learnings here tomorrow, and open the relevant PRs, if there is interest. |
I’m certainly interested! Thanks for this hard work, @svilupp ! |
TL;DR The world makes sense again! Arrow.jl is the fastest reader now (except for one case). It took leveraging threads, skipping unnecessary resizing of buffers, some initialization, and adding support for InlineStrings (stack-allocated strings). Details and the implementation for testing are in here Here are some learnings for those of you seeking Arrow.jl performance:
The rest is probably not suitable for most users, as it involves changing the package internals:
|
First of all, thank you for this amazing package!
Recently, I've been loading a lot of large files and it felt like Arrow.jl loading times are greater than Python. I wanted to quantify this feeling, so I hacked up a rough benchmark (code + results below).
Observations
use_pyarrow=True
which improves the benchmarks quite a lot!)Proposals
Benchmarking results
Machine:
Task 1: 10x count nonmissing elements in the first column of a table
Data: 2 columns of 5K-long strings each, 10% of data missing, 10K rows
Timings: (ordered by Uncompressed, LZ4, ZSTD)
Data: 32 partitions (!), 2 columns of 5K-long strings each, 10% of data missing, 10K rows
Timings: (ordered by Uncompressed, LZ4, ZSTD)
(Arrow.jl timing also benefits from a quick fix to TranscodingStreams)
Task 2: 10x mean value of Int column in the first column of a table
Data: 10 columns, Int64, 10M rows
Timings: (ordered by Uncompressed, LZ4, ZSTD)
Data: 32 partitions (!), 10 columns, Int64, 10M rows
Timings: (ordered by Uncompressed, LZ4, ZSTD)
(Arrow.jl timing also benefits from a quick fix to TranscodingStreams)
Benchmark details
benchmark.jl
benchmark.py (all 32 threads active, 1 partition/RecordBarch in the arrow file)
benchmark_partitioned.py (all 32 threads active, 32 partitions/RecordBatches in arrow files)
benchmark_setup.py
benchmark_setup.jl
The text was updated successfully, but these errors were encountered: