Buffer records in row format in memory for SortExec #2146

yjshen · 2022-04-04T04:27:09Z

Which issue does this PR close?

Closes #2151.

This PR is based on #2132.

Rationale for this change

Row format is more cache-friendly for "row logic" operators, at the cost of more computation involved while incorporating in vectorized engines like our DataFusion. (there would be extra row <-> columnar conversions of course)

This work is to explore the usage of row-format as the in-memory records buffer layout, for benchmark, test, and discussion purposes.

yjshen · 2022-04-04T05:13:36Z

Hardware Settings:

H/W path                  Device          Class          Description
====================================================================
                                          system         MS-7D53 (To be filled by O.E.M.)
/0                                        bus            MPG X570S EDGE MAX WIFI (MS-7D53)
/0/0                                      memory         64KiB BIOS
/0/11                                     memory         32GiB System Memory
/0/11/0                                   memory         3600 MHz (0.3 ns) [empty]
/0/11/1                                   memory         16GiB DIMM DDR4 Synchronous Unbuffered (Unregistered) 3600 MHz (0.3 
/0/11/2                                   memory         3600 MHz (0.3 ns) [empty]
/0/11/3                                   memory         16GiB DIMM DDR4 Synchronous Unbuffered (Unregistered) 3600 MHz (0.3 
/0/14                                     memory         1MiB L1 cache
/0/15                                     memory         8MiB L2 cache
/0/16                                     memory         64MiB L3 cache
/0/17                                     processor      AMD Ryzen 9 5950X 16-Core Processor

A modified version of TPC-H q1:

select
    l_returnflag,
    l_linestatus,
    l_quantity,
    l_extendedprice,
    l_discount,
    l_tax
from
    lineitem
order by
    l_extendedprice,
    l_discount;

yjshen · 2022-04-04T05:32:57Z

TPC-H SF=1

master:

Running benchmarks with the following options: DataFusionBenchmarkOpt { query: 1, debug: false, iterations: 3, partitions: 2, batch_size: 4096, path: "../../tpch-parquet/", file_format: "parquet", mem_table: false, output_path: None }
Query 1 iteration 0 took 2851.7 ms and returned 6001214 rows
Query 1 iteration 1 took 2817.7 ms and returned 6001214 rows
Query 1 iteration 2 took 2735.9 ms and returned 6001214 rows
Query 1 avg time: 2801.75 ms

This PR:

Running benchmarks with the following options: DataFusionBenchmarkOpt { query: 1, debug: false, iterations: 3, partitions: 2, batch_size: 4096, path: "/home/yijie/sort_test/tpch-parquet", file_format: "parquet", mem_table: false, output_path: None }
Query 1 iteration 0 took 3174.9 ms and returned 6001214 rows
Query 1 iteration 1 took 3130.8 ms and returned 6001214 rows
Query 1 iteration 2 took 3058.3 ms and returned 6001214 rows
Query 1 avg time: 3121.35 ms

The row format comes with a price of more computation, with ~11% performance regression witnessed. Although this PR is showing a better cache locality, the computation cost overweight the cache benefits:

sudo perf stat -a -e cache-misses,cache-references,l3_cache_accesses,l3_misses,dTLB-load-misses,dTLB-loads target/release/tpch benchmark datafusion --iterations 3 --path /home/yijie/sort_test/tpch-parquet --format parquet --query 1 --batch-size 4096

master

 Performance counter stats for 'system wide':

       756,702,553      cache-misses              #   34.256 % of all cache refs    
     2,208,936,269      cache-references                                            
     1,156,898,644      l3_cache_accesses                                           
       362,860,081      l3_misses                                                   
       215,166,268      dTLB-load-misses          #   45.27% of all dTLB cache accesses
       475,312,480      dTLB-loads                                                  

       8.774750150 seconds time elapsed

This PR:

 Performance counter stats for 'system wide':

       593,785,538      cache-misses              #   25.841 % of all cache refs    
     2,297,807,480      cache-references                                            
       835,622,737      l3_cache_accesses                                           
       227,838,803      l3_misses                                                   
       146,442,785      dTLB-load-misses          #   55.76% of all dTLB cache accesses
       262,616,456      dTLB-loads                                                  

      10.556249442 seconds time elapsed

Much better cache accessing behavior with the row format.

yjshen · 2022-04-04T05:52:08Z

TPC-H SF=10

master

target/release/tpch benchmark datafusion --iterations 3 --path /home/yijie/sort_test/tpch-parquet-sf10 --format parquet --query 1 --batch-size 4096

Running benchmarks with the following options: DataFusionBenchmarkOpt { query: 1, debug: false, iterations: 3, partitions: 2, batch_size: 4096, path: "/home/yijie/sort_test/tpch-parquet-sf10", file_format: "parquet", mem_table: false, output_path: None }
Query 1 iteration 0 took 47772.6 ms and returned 59986051 rows
Query 1 iteration 1 took 47899.2 ms and returned 59986051 rows
Query 1 iteration 2 took 48861.9 ms and returned 59986051 rows
Query 1 avg time: 48177.89 ms

This PR:

Running benchmarks with the following options: DataFusionBenchmarkOpt { query: 1, debug: false, iterations: 3, partitions: 2, batch_size: 4096, path: "/home/yijie/sort_test/tpch-parquet-sf10", file_format: "parquet", mem_table: false, output_path: None }
Query 1 iteration 0 took 38565.1 ms and returned 59986051 rows
Query 1 iteration 1 took 37786.0 ms and returned 59986051 rows
Query 1 iteration 2 took 37056.7 ms and returned 59986051 rows
Query 1 avg time: 37802.62 ms

The performance has improved by ~21.5% this time. The advantage of better memory accessing pattern pays off the extra computation for row <-> columnar transformation.

master

 Performance counter stats for 'system wide':

     9,443,323,018      cache-misses              #   41.338 % of all cache refs    
    22,844,399,240      cache-references                                            
    14,787,052,560      l3_cache_accesses                                           
     5,753,820,101      l3_misses                                                   
     3,046,705,364      dTLB-load-misses          #   54.75% of all dTLB cache accesses
     5,565,251,257      dTLB-loads                                                  

     147.045336524 seconds time elapsed

This PR:

 Performance counter stats for 'system wide':

     6,750,648,518      cache-misses              #   30.344 % of all cache refs    
    22,247,021,905      cache-references                                            
    10,821,629,799      l3_cache_accesses                                           
     2,122,684,404      l3_misses                                                   
     2,348,824,410      dTLB-load-misses          #   64.09% of all dTLB cache accesses
     3,664,743,134      dTLB-loads                                                  

     115.306819499 seconds time elapsed

alamb · 2022-04-04T19:58:24Z

This PR looks amazing -- I can't wait to review it -- I plan to do so tomorrow. Thank you @yjshen

yjshen · 2022-04-05T04:44:08Z

Thanks @alamb! This PR acts more like a playground and testbed for the row format usage in the sort's payload. It's not mature enough but I've got some numbers we could discuss.

As revealed by the benchmark above, row <-> columnar batch comes with the price of more extra CPU computations, and the cost pays off when columnar memory access is more expensive while dealing with a bigger dataset.

Obviously, we need a clever/adaptive mechanism to choose which memory layout we should employ regarding the number of payload columns, size of input data, etc. So I'm posting the experimental implementation with some micro bench results to gain insights from more brilliant minds.

cc @Dandandan @houqp @tustvold you might be interested in this as well.

alamb · 2022-04-05T20:59:03Z

I didn't quite make it to this PR today, but I plan to do so tomorrow (I want to check it out locally and play around with it / make some flamegraphs)

alamb · 2022-04-07T20:20:10Z

As I have mentioned, I am very interested in this work but I have not yet found time to give it the deep study it deserves (I am especially interested in the profiling).

However, I have been super busy with other work items and keeping things going in IOx and DataFusion. I am not sure when I will have a chance to study this one carefully

alamb · 2022-04-15T15:35:03Z

Marking as draft to make it clear this PR isn't waiting on review -- please feel free to mark it as ready for review if that analysis is not correct or if it is ready to review again

yjshen added 14 commits April 1, 2022 12:15

wip

6f83571

workable version

4497d11

sort to bench

971ded6

use sort2

acff377

Replace

1e120de

revert unexpected changes

ad361c1

iter rework

28f5f45

minor

e56919b

prepare

75c6912

avoid evaluating sort columns twice

6bf7b9d

buggy on spill but runnable to bench

e63fad3

1

85b4d85

unnecessary utf8 check

154ea0c

docs more

e14fea9

github-actions bot added the datafusion Changes in the datafusion crate label Apr 4, 2022

alamb mentioned this pull request Apr 5, 2022

Research performance improvements in N-way merging #2148

Closed

alamb mentioned this pull request Apr 6, 2022

Easier flamegraph / profiling support for datafusion benchmarks #2174

Closed

xudong963 added the performance Make DataFusion faster label Apr 9, 2022

alamb marked this pull request as draft April 15, 2022 15:35

yjshen self-assigned this May 1, 2022

yjshen mentioned this pull request May 8, 2022

[Epic]: Complete ROW Format (Missing features) #1861

Closed

37 tasks

andygrove removed the datafusion Changes in the datafusion crate label Jun 3, 2022

yjshen closed this Nov 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Buffer records in row format in memory for SortExec #2146

Buffer records in row format in memory for SortExec #2146

yjshen commented Apr 4, 2022 •

edited

Loading

yjshen commented Apr 4, 2022

yjshen commented Apr 4, 2022 •

edited

Loading

yjshen commented Apr 4, 2022

alamb commented Apr 4, 2022

yjshen commented Apr 5, 2022

alamb commented Apr 5, 2022

alamb commented Apr 7, 2022

alamb commented Apr 15, 2022

Buffer records in row format in memory for SortExec #2146

Buffer records in row format in memory for SortExec #2146

Conversation

yjshen commented Apr 4, 2022 • edited Loading

Which issue does this PR close?

Rationale for this change

yjshen commented Apr 4, 2022

A modified version of TPC-H q1:

yjshen commented Apr 4, 2022 • edited Loading

TPC-H SF=1

yjshen commented Apr 4, 2022

TPC-H SF=10

alamb commented Apr 4, 2022

yjshen commented Apr 5, 2022

alamb commented Apr 5, 2022

alamb commented Apr 7, 2022

alamb commented Apr 15, 2022

yjshen commented Apr 4, 2022 •

edited

Loading

yjshen commented Apr 4, 2022 •

edited

Loading