Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

Added fast path for validating ASCII text (~1.12-1.89x improvement on reading ASCII parquet data) #542

Merged
merged 7 commits into from
Oct 19, 2021

Conversation

Dandandan
Copy link
Collaborator

A bit inspired by a commit from @ritchie46 in polars.

We can use is_ascii for smaller strings, as it's much faster than the utf8 check, especially when applied on a full buffer.
For (larger) utf8 there is not a real perf hit.

Results on the benchmarks:

read utf8 2^10                 1.00     33.9±0.65µs    1.16     39.4±1.88µs
read utf8 2^12                 1.00     48.4±0.37µs    1.38     66.9±2.05µs
read utf8 2^14                 1.00    101.7±2.47µs    1.57    159.2±1.98µs
read utf8 2^16                 1.00    293.3±2.35µs    1.80   528.5±20.57µs
read utf8 2^18                 1.00   1056.6±9.58µs    1.89      2.0±0.02ms
read utf8 2^20                 1.00      4.4±0.02ms    1.81      8.0±0.12ms
read utf8 dict 2^10            1.00     36.7±1.12µs    1.14     41.9±1.10µs
read utf8 dict 2^12            1.00     55.9±1.40µs    1.27     71.2±1.87µs
read utf8 dict 2^14            1.00    123.2±0.77µs    1.51    186.4±2.38µs
read utf8 dict 2^16            1.00    360.8±3.16µs    1.74    627.9±7.52µs
read utf8 dict 2^18            1.00  1340.1±10.74µs    1.74      2.3±0.03ms
read utf8 dict 2^20            1.00      5.4±0.07ms    1.71      9.2±0.10ms
read utf8 large 2^10           1.00     54.0±1.15µs    1.10     59.3±1.44µs
read utf8 large 2^12           1.00    131.0±1.93µs    1.05    137.1±4.08µs
read utf8 large 2^14           1.00   911.1±19.27µs    1.05   959.5±14.10µs
read utf8 large 2^16           1.01      3.9±0.03ms    1.00      3.9±0.10ms
read utf8 large 2^18           1.00     17.1±0.16ms    1.06     18.1±0.40ms
read utf8 large 2^20           1.00     69.8±0.54ms    1.11     77.4±1.07ms
read utf8 multi 2^10           1.00     37.1±0.31µs    1.12     41.6±0.81µs
read utf8 multi 2^12           1.00     54.2±0.42µs    1.39     75.6±5.97µs
read utf8 multi 2^14           1.00    110.9±1.12µs    1.55    171.7±2.59µs
read utf8 multi 2^16           1.00   358.4±11.54µs    1.63    582.9±7.18µs
read utf8 multi 2^18           1.00   1258.4±7.18µs    1.83      2.3±0.04ms
read utf8 multi 2^20           1.00      5.1±0.05ms    1.68      8.6±0.10ms
read utf8 multi snappy 2^10    1.00     33.9±0.82µs    1.18     39.8±0.81µs
read utf8 multi snappy 2^12    1.00     52.3±0.44µs    1.30     68.0±1.97µs
read utf8 multi snappy 2^14    1.00    124.9±4.95µs    1.53    190.9±5.59µs
read utf8 multi snappy 2^16    1.00    385.8±7.92µs    1.68    646.7±6.01µs
read utf8 multi snappy 2^18    1.00   1429.2±6.74µs    1.73      2.5±0.22ms
read utf8 multi snappy 2^20    1.00      6.0±0.05ms    1.60      9.6±0.33ms

@Dandandan Dandandan changed the title Add fast path for validating ASCII text (~1.14-1.89x improvement on ASCII strings) Add fast path for validating ASCII text (~1.14-1.89x improvement on reading ASCII parquet data) Oct 19, 2021
});
const SIMD_CHUNK_SIZE: usize = 64;

let all_ascii = values.is_ascii();
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI @ritchie46 this implementation should be faster than writing a manual loop with all for larger buffers.

@codecov
Copy link

codecov bot commented Oct 19, 2021

Codecov Report

Merging #542 (9ba4b0f) into main (002a0ef) will increase coverage by 0.02%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #542      +/-   ##
==========================================
+ Coverage   81.26%   81.28%   +0.02%     
==========================================
  Files         380      380              
  Lines       23382    23386       +4     
==========================================
+ Hits        19001    19010       +9     
+ Misses       4381     4376       -5     
Impacted Files Coverage Δ
src/array/specification.rs 83.33% <100.00%> (+2.08%) ⬆️
src/io/ipc/write/common.rs 76.28% <0.00%> (+0.64%) ⬆️
src/bitmap/utils/slice_iterator.rs 94.02% <0.00%> (+1.49%) ⬆️
src/compute/arithmetics/time.rs 46.93% <0.00%> (+2.04%) ⬆️
src/io/avro/read/schema.rs 55.95% <0.00%> (+2.38%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 002a0ef...9ba4b0f. Read the comment docs.

@Dandandan Dandandan changed the title Add fast path for validating ASCII text (~1.14-1.89x improvement on reading ASCII parquet data) Add fast path for validating ASCII text (~1.12-1.89x improvement on reading ASCII parquet data) Oct 19, 2021
@jorgecarleitao jorgecarleitao added the enhancement An improvement to an existing feature label Oct 19, 2021
@jorgecarleitao
Copy link
Owner

Looks great! Thanks a lot @Dandandan ! Could you rebase on top of the latest main? I merged #543 just to safeguard us against edge cases ^_^

@Dandandan
Copy link
Collaborator Author

Looks great! Thanks a lot @Dandandan ! Could you rebase on top of the latest main? I merged #543 just to safeguard us against edge cases ^_^

Done!

@jorgecarleitao jorgecarleitao changed the title Add fast path for validating ASCII text (~1.12-1.89x improvement on reading ASCII parquet data) Added fast path for validating ASCII text (~1.12-1.89x improvement on reading ASCII parquet data) Oct 19, 2021
@jorgecarleitao jorgecarleitao merged commit b62184d into jorgecarleitao:main Oct 19, 2021
@jorgecarleitao
Copy link
Owner

benchmarks on https://github.com/DataEngineeringLabs/parquet-benchmark updated. Solid improvements 👍 Thanks a lot @ritchie46 and @Dandandan . 2.5x faster than pyarrow.

I also just noticed that pyarrow uses multi-threading by default, so I still want to make the comparison a bit more fair. :)

@Dandandan
Copy link
Collaborator Author

benchmarks on https://github.com/DataEngineeringLabs/parquet-benchmark updated. Solid improvements 👍 Thanks a lot @ritchie46 and @Dandandan . 2.5x faster than pyarrow.

I also just noticed that pyarrow uses multi-threading by default, so I still want to make the comparison a bit more fair. :)

Nice, thanks @ritchie46 for the great idea.

I should rerun some benchmarks with DataFusion/arrow2 soon :)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement An improvement to an existing feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants