-
Notifications
You must be signed in to change notification settings - Fork 223
Added fast path for validating ASCII text (~1.12-1.89x improvement on reading ASCII parquet data) #542
Conversation
src/array/specification.rs
Outdated
}); | ||
const SIMD_CHUNK_SIZE: usize = 64; | ||
|
||
let all_ascii = values.is_ascii(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI @ritchie46 this implementation should be faster than writing a manual loop with all
for larger buffers.
Codecov Report
@@ Coverage Diff @@
## main #542 +/- ##
==========================================
+ Coverage 81.26% 81.28% +0.02%
==========================================
Files 380 380
Lines 23382 23386 +4
==========================================
+ Hits 19001 19010 +9
+ Misses 4381 4376 -5
Continue to review full report at Codecov.
|
Looks great! Thanks a lot @Dandandan ! Could you rebase on top of the latest main? I merged #543 just to safeguard us against edge cases ^_^ |
d9a24a5
to
9ba4b0f
Compare
Done! |
benchmarks on https://github.com/DataEngineeringLabs/parquet-benchmark updated. Solid improvements 👍 Thanks a lot @ritchie46 and @Dandandan . 2.5x faster than pyarrow. I also just noticed that pyarrow uses multi-threading by default, so I still want to make the comparison a bit more fair. :) |
Nice, thanks @ritchie46 for the great idea. I should rerun some benchmarks with DataFusion/arrow2 soon :) |
A bit inspired by a commit from @ritchie46 in polars.
We can use
is_ascii
for smaller strings, as it's much faster than theutf8
check, especially when applied on a full buffer.For (larger) utf8 there is not a real perf hit.
Results on the benchmarks: