-
Notifications
You must be signed in to change notification settings - Fork 221
Improved performance of sum aggregation via aligned loads (-10%) #445
Conversation
a236c47
to
de512f2
Compare
Codecov Report
@@ Coverage Diff @@
## main #445 +/- ##
==========================================
+ Coverage 79.89% 80.02% +0.13%
==========================================
Files 371 371
Lines 22776 22841 +65
==========================================
+ Hits 18197 18279 +82
+ Misses 4579 4562 -17
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. Left some ideas.
de512f2
to
4bf5c3e
Compare
An update on the benchmarks: Gnuplot not found, using plotters backend
sum 2^10 f64 time: [151.72 ns 152.40 ns 153.21 ns]
change: [-3.5650% -3.1521% -2.7679%] (p = 0.00 < 0.05)
Performance has improved.
sum 2^10 i64 time: [113.88 ns 113.91 ns 113.94 ns]
change: [+4.6575% +4.7503% +4.8275%] (p = 0.00 < 0.05)
Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
2 (2.00%) high mild
6 (6.00%) high severe
sum 2^12 f64 time: [526.64 ns 527.46 ns 528.31 ns]
change: [-2.1437% -1.9416% -1.7502%] (p = 0.00 < 0.05)
Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
5 (5.00%) high mild
2 (2.00%) high severe
sum 2^12 i64 time: [348.69 ns 348.94 ns 349.20 ns]
change: [-22.641% -21.575% -20.855%] (p = 0.00 < 0.05)
Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
2 (2.00%) high mild
sum 2^14 f64 time: [1.9927 us 1.9929 us 1.9930 us]
change: [-3.1503% -2.7050% -2.3456%] (p = 0.00 < 0.05)
Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
3 (3.00%) high mild
2 (2.00%) high severe
sum 2^14 i64 time: [1.7309 us 1.7317 us 1.7325 us]
change: [-8.9944% -8.9284% -8.8592%] (p = 0.00 < 0.05)
Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
2 (2.00%) high mild
sum 2^16 f64 time: [8.3187 us 8.3202 us 8.3222 us]
change: [-5.6838% -5.3678% -5.0664%] (p = 0.00 < 0.05)
Performance has improved.
Found 15 outliers among 100 measurements (15.00%)
6 (6.00%) low severe
3 (3.00%) high mild
6 (6.00%) high severe
sum 2^16 i64 time: [9.2400 us 9.2499 us 9.2608 us]
change: [-7.9995% -7.8217% -7.6153%] (p = 0.00 < 0.05)
Performance has improved.
Found 16 outliers among 100 measurements (16.00%)
5 (5.00%) low severe
2 (2.00%) low mild
3 (3.00%) high mild
6 (6.00%) high severe
sum 2^18 f64 time: [35.474 us 35.487 us 35.504 us]
change: [-9.2037% -9.1425% -9.0746%] (p = 0.00 < 0.05)
Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
6 (6.00%) high mild
4 (4.00%) high severe
sum 2^18 i64 time: [35.166 us 35.174 us 35.182 us]
change: [-10.221% -10.140% -10.061%] (p = 0.00 < 0.05)
Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
7 (7.00%) high mild
1 (1.00%) high severe
sum 2^20 f64 time: [188.31 us 188.55 us 188.80 us]
change: [-2.2460% -1.7823% -1.2743%] (p = 0.00 < 0.05)
Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
2 (2.00%) low mild
5 (5.00%) high mild
3 (3.00%) high severe
sum 2^20 i64 time: [207.85 us 209.90 us 212.52 us]
change: [-8.0797% -6.7231% -5.2968%] (p = 0.00 < 0.05)
Performance has improved. It seems to be consistent ~5-10% improvement. Still have to find some union in traits for packed_simd and native. |
4bf5c3e
to
90b4f85
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Left a small comment that imo further simplifies the code.
This adds aligned load instruction to the SIMD aggregation of arrays without null values. This implementation works on any alignment, and does not require an aligned allocator, and also works on sliced arrays.
Benchmark is a bit mixed on small data sizes. It seems to improving with more data, I want to run one on more data later.