Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Buffer records in row format in memory for SortExec #2146

Closed
wants to merge 14 commits into from

Conversation

yjshen
Copy link
Member

@yjshen yjshen commented Apr 4, 2022

Which issue does this PR close?

Closes #2151.

This PR is based on #2132.

Rationale for this change

Row format is more cache-friendly for "row logic" operators, at the cost of more computation involved while incorporating in vectorized engines like our DataFusion. (there would be extra row <-> columnar conversions of course)

This work is to explore the usage of row-format as the in-memory records buffer layout, for benchmark, test, and discussion purposes.

@github-actions github-actions bot added the datafusion Changes in the datafusion crate label Apr 4, 2022
@yjshen
Copy link
Member Author

yjshen commented Apr 4, 2022

Hardware Settings:

H/W path                  Device          Class          Description
====================================================================
                                          system         MS-7D53 (To be filled by O.E.M.)
/0                                        bus            MPG X570S EDGE MAX WIFI (MS-7D53)
/0/0                                      memory         64KiB BIOS
/0/11                                     memory         32GiB System Memory
/0/11/0                                   memory         3600 MHz (0.3 ns) [empty]
/0/11/1                                   memory         16GiB DIMM DDR4 Synchronous Unbuffered (Unregistered) 3600 MHz (0.3 
/0/11/2                                   memory         3600 MHz (0.3 ns) [empty]
/0/11/3                                   memory         16GiB DIMM DDR4 Synchronous Unbuffered (Unregistered) 3600 MHz (0.3 
/0/14                                     memory         1MiB L1 cache
/0/15                                     memory         8MiB L2 cache
/0/16                                     memory         64MiB L3 cache
/0/17                                     processor      AMD Ryzen 9 5950X 16-Core Processor

A modified version of TPC-H q1:

select
    l_returnflag,
    l_linestatus,
    l_quantity,
    l_extendedprice,
    l_discount,
    l_tax
from
    lineitem
order by
    l_extendedprice,
    l_discount;

@yjshen
Copy link
Member Author

yjshen commented Apr 4, 2022

TPC-H SF=1

master:

Running benchmarks with the following options: DataFusionBenchmarkOpt { query: 1, debug: false, iterations: 3, partitions: 2, batch_size: 4096, path: "../../tpch-parquet/", file_format: "parquet", mem_table: false, output_path: None }
Query 1 iteration 0 took 2851.7 ms and returned 6001214 rows
Query 1 iteration 1 took 2817.7 ms and returned 6001214 rows
Query 1 iteration 2 took 2735.9 ms and returned 6001214 rows
Query 1 avg time: 2801.75 ms

This PR:

Running benchmarks with the following options: DataFusionBenchmarkOpt { query: 1, debug: false, iterations: 3, partitions: 2, batch_size: 4096, path: "/home/yijie/sort_test/tpch-parquet", file_format: "parquet", mem_table: false, output_path: None }
Query 1 iteration 0 took 3174.9 ms and returned 6001214 rows
Query 1 iteration 1 took 3130.8 ms and returned 6001214 rows
Query 1 iteration 2 took 3058.3 ms and returned 6001214 rows
Query 1 avg time: 3121.35 ms

The row format comes with a price of more computation, with ~11% performance regression witnessed. Although this PR is showing a better cache locality, the computation cost overweight the cache benefits:

sudo perf stat -a -e cache-misses,cache-references,l3_cache_accesses,l3_misses,dTLB-load-misses,dTLB-loads target/release/tpch benchmark datafusion --iterations 3 --path /home/yijie/sort_test/tpch-parquet --format parquet --query 1 --batch-size 4096

master

 Performance counter stats for 'system wide':

       756,702,553      cache-misses              #   34.256 % of all cache refs    
     2,208,936,269      cache-references                                            
     1,156,898,644      l3_cache_accesses                                           
       362,860,081      l3_misses                                                   
       215,166,268      dTLB-load-misses          #   45.27% of all dTLB cache accesses
       475,312,480      dTLB-loads                                                  

       8.774750150 seconds time elapsed

This PR:

 Performance counter stats for 'system wide':

       593,785,538      cache-misses              #   25.841 % of all cache refs    
     2,297,807,480      cache-references                                            
       835,622,737      l3_cache_accesses                                           
       227,838,803      l3_misses                                                   
       146,442,785      dTLB-load-misses          #   55.76% of all dTLB cache accesses
       262,616,456      dTLB-loads                                                  

      10.556249442 seconds time elapsed

Much better cache accessing behavior with the row format.

@yjshen
Copy link
Member Author

yjshen commented Apr 4, 2022

TPC-H SF=10

master

target/release/tpch benchmark datafusion --iterations 3 --path /home/yijie/sort_test/tpch-parquet-sf10 --format parquet --query 1 --batch-size 4096

Running benchmarks with the following options: DataFusionBenchmarkOpt { query: 1, debug: false, iterations: 3, partitions: 2, batch_size: 4096, path: "/home/yijie/sort_test/tpch-parquet-sf10", file_format: "parquet", mem_table: false, output_path: None }
Query 1 iteration 0 took 47772.6 ms and returned 59986051 rows
Query 1 iteration 1 took 47899.2 ms and returned 59986051 rows
Query 1 iteration 2 took 48861.9 ms and returned 59986051 rows
Query 1 avg time: 48177.89 ms

This PR:

Running benchmarks with the following options: DataFusionBenchmarkOpt { query: 1, debug: false, iterations: 3, partitions: 2, batch_size: 4096, path: "/home/yijie/sort_test/tpch-parquet-sf10", file_format: "parquet", mem_table: false, output_path: None }
Query 1 iteration 0 took 38565.1 ms and returned 59986051 rows
Query 1 iteration 1 took 37786.0 ms and returned 59986051 rows
Query 1 iteration 2 took 37056.7 ms and returned 59986051 rows
Query 1 avg time: 37802.62 ms

The performance has improved by ~21.5% this time. The advantage of better memory accessing pattern pays off the extra computation for row <-> columnar transformation.

master

 Performance counter stats for 'system wide':

     9,443,323,018      cache-misses              #   41.338 % of all cache refs    
    22,844,399,240      cache-references                                            
    14,787,052,560      l3_cache_accesses                                           
     5,753,820,101      l3_misses                                                   
     3,046,705,364      dTLB-load-misses          #   54.75% of all dTLB cache accesses
     5,565,251,257      dTLB-loads                                                  

     147.045336524 seconds time elapsed

This PR:

 Performance counter stats for 'system wide':

     6,750,648,518      cache-misses              #   30.344 % of all cache refs    
    22,247,021,905      cache-references                                            
    10,821,629,799      l3_cache_accesses                                           
     2,122,684,404      l3_misses                                                   
     2,348,824,410      dTLB-load-misses          #   64.09% of all dTLB cache accesses
     3,664,743,134      dTLB-loads                                                  

     115.306819499 seconds time elapsed

@alamb
Copy link
Contributor

alamb commented Apr 4, 2022

This PR looks amazing -- I can't wait to review it -- I plan to do so tomorrow. Thank you @yjshen

@yjshen
Copy link
Member Author

yjshen commented Apr 5, 2022

Thanks @alamb! This PR acts more like a playground and testbed for the row format usage in the sort's payload. It's not mature enough but I've got some numbers we could discuss.

As revealed by the benchmark above, row <-> columnar batch comes with the price of more extra CPU computations, and the cost pays off when columnar memory access is more expensive while dealing with a bigger dataset.

Obviously, we need a clever/adaptive mechanism to choose which memory layout we should employ regarding the number of payload columns, size of input data, etc. So I'm posting the experimental implementation with some micro bench results to gain insights from more brilliant minds.

cc @Dandandan @houqp @tustvold you might be interested in this as well.

@alamb
Copy link
Contributor

alamb commented Apr 5, 2022

I didn't quite make it to this PR today, but I plan to do so tomorrow (I want to check it out locally and play around with it / make some flamegraphs)

@alamb
Copy link
Contributor

alamb commented Apr 7, 2022

As I have mentioned, I am very interested in this work but I have not yet found time to give it the deep study it deserves (I am especially interested in the profiling).

However, I have been super busy with other work items and keeping things going in IOx and DataFusion. I am not sure when I will have a chance to study this one carefully

@xudong963 xudong963 added the performance Make DataFusion faster label Apr 9, 2022
@alamb
Copy link
Contributor

alamb commented Apr 15, 2022

Marking as draft to make it clear this PR isn't waiting on review -- please feel free to mark it as ready for review if that analysis is not correct or if it is ready to review again

@alamb alamb marked this pull request as draft April 15, 2022 15:35
@yjshen yjshen self-assigned this May 1, 2022
@andygrove andygrove removed the datafusion Changes in the datafusion crate label Jun 3, 2022
@yjshen yjshen closed this Nov 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Make DataFusion faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Research usage of row format for sort records buffering
4 participants