Added fast path for validating ASCII text (~1.12-1.89x improvement on reading ASCII parquet data) #542

Dandandan · 2021-10-19T18:33:27Z

A bit inspired by a commit from @ritchie46 in polars.

We can use is_ascii for smaller strings, as it's much faster than the utf8 check, especially when applied on a full buffer.
For (larger) utf8 there is not a real perf hit.

Results on the benchmarks:

read utf8 2^10                 1.00     33.9±0.65µs    1.16     39.4±1.88µs
read utf8 2^12                 1.00     48.4±0.37µs    1.38     66.9±2.05µs
read utf8 2^14                 1.00    101.7±2.47µs    1.57    159.2±1.98µs
read utf8 2^16                 1.00    293.3±2.35µs    1.80   528.5±20.57µs
read utf8 2^18                 1.00   1056.6±9.58µs    1.89      2.0±0.02ms
read utf8 2^20                 1.00      4.4±0.02ms    1.81      8.0±0.12ms
read utf8 dict 2^10            1.00     36.7±1.12µs    1.14     41.9±1.10µs
read utf8 dict 2^12            1.00     55.9±1.40µs    1.27     71.2±1.87µs
read utf8 dict 2^14            1.00    123.2±0.77µs    1.51    186.4±2.38µs
read utf8 dict 2^16            1.00    360.8±3.16µs    1.74    627.9±7.52µs
read utf8 dict 2^18            1.00  1340.1±10.74µs    1.74      2.3±0.03ms
read utf8 dict 2^20            1.00      5.4±0.07ms    1.71      9.2±0.10ms
read utf8 large 2^10           1.00     54.0±1.15µs    1.10     59.3±1.44µs
read utf8 large 2^12           1.00    131.0±1.93µs    1.05    137.1±4.08µs
read utf8 large 2^14           1.00   911.1±19.27µs    1.05   959.5±14.10µs
read utf8 large 2^16           1.01      3.9±0.03ms    1.00      3.9±0.10ms
read utf8 large 2^18           1.00     17.1±0.16ms    1.06     18.1±0.40ms
read utf8 large 2^20           1.00     69.8±0.54ms    1.11     77.4±1.07ms
read utf8 multi 2^10           1.00     37.1±0.31µs    1.12     41.6±0.81µs
read utf8 multi 2^12           1.00     54.2±0.42µs    1.39     75.6±5.97µs
read utf8 multi 2^14           1.00    110.9±1.12µs    1.55    171.7±2.59µs
read utf8 multi 2^16           1.00   358.4±11.54µs    1.63    582.9±7.18µs
read utf8 multi 2^18           1.00   1258.4±7.18µs    1.83      2.3±0.04ms
read utf8 multi 2^20           1.00      5.1±0.05ms    1.68      8.6±0.10ms
read utf8 multi snappy 2^10    1.00     33.9±0.82µs    1.18     39.8±0.81µs
read utf8 multi snappy 2^12    1.00     52.3±0.44µs    1.30     68.0±1.97µs
read utf8 multi snappy 2^14    1.00    124.9±4.95µs    1.53    190.9±5.59µs
read utf8 multi snappy 2^16    1.00    385.8±7.92µs    1.68    646.7±6.01µs
read utf8 multi snappy 2^18    1.00   1429.2±6.74µs    1.73      2.5±0.22ms
read utf8 multi snappy 2^20    1.00      6.0±0.05ms    1.60      9.6±0.33ms

src/array/specification.rs

Dandandan · 2021-10-19T18:43:53Z

src/array/specification.rs

-    });
+    const SIMD_CHUNK_SIZE: usize = 64;
+
+    let all_ascii = values.is_ascii();


FYI @ritchie46 this implementation should be faster than writing a manual loop with all for larger buffers.

codecov · 2021-10-19T18:44:07Z

Codecov Report

Merging #542 (9ba4b0f) into main (002a0ef) will increase coverage by 0.02%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main     #542      +/-   ##
==========================================
+ Coverage   81.26%   81.28%   +0.02%     
==========================================
  Files         380      380              
  Lines       23382    23386       +4     
==========================================
+ Hits        19001    19010       +9     
+ Misses       4381     4376       -5

Impacted Files	Coverage Δ
src/array/specification.rs	`83.33% <100.00%> (+2.08%)`	⬆️
src/io/ipc/write/common.rs	`76.28% <0.00%> (+0.64%)`	⬆️
src/bitmap/utils/slice_iterator.rs	`94.02% <0.00%> (+1.49%)`	⬆️
src/compute/arithmetics/time.rs	`46.93% <0.00%> (+2.04%)`	⬆️
src/io/avro/read/schema.rs	`55.95% <0.00%> (+2.38%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 002a0ef...9ba4b0f. Read the comment docs.

jorgecarleitao · 2021-10-19T19:34:08Z

Looks great! Thanks a lot @Dandandan ! Could you rebase on top of the latest main? I merged #543 just to safeguard us against edge cases ^_^

Dandandan · 2021-10-19T19:58:29Z

Looks great! Thanks a lot @Dandandan ! Could you rebase on top of the latest main? I merged #543 just to safeguard us against edge cases ^_^

Done!

jorgecarleitao · 2021-10-19T20:38:35Z

benchmarks on https://github.com/DataEngineeringLabs/parquet-benchmark updated. Solid improvements 👍 Thanks a lot @ritchie46 and @Dandandan . 2.5x faster than pyarrow.

I also just noticed that pyarrow uses multi-threading by default, so I still want to make the comparison a bit more fair. :)

Dandandan · 2021-10-19T21:03:52Z

benchmarks on https://github.com/DataEngineeringLabs/parquet-benchmark updated. Solid improvements 👍 Thanks a lot @ritchie46 and @Dandandan . 2.5x faster than pyarrow.

I also just noticed that pyarrow uses multi-threading by default, so I still want to make the comparison a bit more fair. :)

Nice, thanks @ritchie46 for the great idea.

I should rerun some benchmarks with DataFusion/arrow2 soon :)

Dandandan changed the title ~~Add fast path for validating ASCII text (~1.14-1.89x improvement on ASCII strings)~~ Add fast path for validating ASCII text (~1.14-1.89x improvement on reading ASCII parquet data) Oct 19, 2021

jorgecarleitao mentioned this pull request Oct 19, 2021

Improved performance of utf8 check for ascii-only (-40% parquet reading ascii-only columns) #541

Closed

jorgecarleitao reviewed Oct 19, 2021

View reviewed changes

src/array/specification.rs Outdated Show resolved Hide resolved

Dandandan commented Oct 19, 2021

View reviewed changes

Dandandan changed the title ~~Add fast path for validating ASCII text (~1.14-1.89x improvement on reading ASCII parquet data)~~ Add fast path for validating ASCII text (~1.12-1.89x improvement on reading ASCII parquet data) Oct 19, 2021

jorgecarleitao mentioned this pull request Oct 19, 2021

Added more tests for utf8 #543

Merged

jorgecarleitao added the enhancement An improvement to an existing feature label Oct 19, 2021

Dandandan added 7 commits October 19, 2021 21:54

Add fast path for checking ascii text

facdc91

Move fast path before loop for ascii

29988ac

Small simplification

b70c734

Clippy

579c0d0

Move implementation

05f76e7

Inline condition

ec2fc31

Clippy

9ba4b0f

Dandandan force-pushed the ascii_parquet branch from d9a24a5 to 9ba4b0f Compare October 19, 2021 19:54

jorgecarleitao changed the title ~~Add fast path for validating ASCII text (~1.12-1.89x improvement on reading ASCII parquet data)~~ Added fast path for validating ASCII text (~1.12-1.89x improvement on reading ASCII parquet data) Oct 19, 2021

jorgecarleitao merged commit b62184d into jorgecarleitao:main Oct 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added fast path for validating ASCII text (~1.12-1.89x improvement on reading ASCII parquet data) #542

Added fast path for validating ASCII text (~1.12-1.89x improvement on reading ASCII parquet data) #542

Dandandan commented Oct 19, 2021

Dandandan Oct 19, 2021

codecov bot commented Oct 19, 2021 •

edited

Loading

jorgecarleitao commented Oct 19, 2021

Dandandan commented Oct 19, 2021

jorgecarleitao commented Oct 19, 2021

Dandandan commented Oct 19, 2021

Added fast path for validating ASCII text (~1.12-1.89x improvement on reading ASCII parquet data) #542

Added fast path for validating ASCII text (~1.12-1.89x improvement on reading ASCII parquet data) #542

Conversation

Dandandan commented Oct 19, 2021

Dandandan Oct 19, 2021

Choose a reason for hiding this comment

codecov bot commented Oct 19, 2021 • edited Loading

Codecov Report

jorgecarleitao commented Oct 19, 2021

Dandandan commented Oct 19, 2021

jorgecarleitao commented Oct 19, 2021

Dandandan commented Oct 19, 2021

codecov bot commented Oct 19, 2021 •

edited

Loading