Explicitly stage strided loads #7230

abadams · 2022-12-12T16:03:37Z

This PR adds a new compiler pass that converts strided loads into dense loads followed by shuffles.

For a stride of two, the trick is to do a dense load of twice the size, and then extract either the even or odd lanes. This was previously done in codegen, where it was challenging, because it's not easy to know there if it's safe to do the double-sized load, as it either loads one element beyond or before the original load. We used the alignment of the ramp base to try to tell if it was safe to shift backwards, and we added padding to internal allocations so that for those at least it was safe to shift forwards. Unfortunately the alignment of the ramp base is usually unknown if you don't know anything about the strides of the input, and adding padding to allocations was a serious wart in our memory allocators.

This PR instead actively looks for evidence elsewhere in the Stmt (at some location which definitely executes whenever the load being transformed executes) that it's safe to read further forwards or backwards in memory. The evidence is in the form of a load at the same base address with a different constant offset. It also clusters groups of these loads so that they do the same dense load each and extract the appropriate slice of lanes. For loads from external buffers, it just does a shorter load, and for loads from internal allocations that we can't shift forwards or backwards, it adds padding to the allocation explicitly, via a new padding field on Allocate nodes.

Edit: Some backends don't like non-power-of-two sized vectors, so for loads from external buffers, we now split the load into two overlapping dense loads of half the size, then shuffle out appropriate lanes from each and concat the results.

steven-johnson · 2022-12-12T17:06:42Z

The excellent PR description of what's going on now (and what we used to do) should be captured in a code comment somewhere.

src/CodeGen_ARM.cpp

Some backends don't like non-power-of-two vectors. Do two overlapping half-sized loads and shuffle instead of one funny-sized load.

…de/Halide into abadams/stage_strided_loads

steven-johnson

LGTM -- don't see any red flags inside Google.

src/CodeGen_ARM.cpp

rootjalex · 2022-12-15T01:38:43Z

src/StageStridedLoads.cpp

+ }
+ // TODO: We do not yet handle nested vectorization here for
+ // ramps which have not already collapsed. We could potentially
+ // handle more interesting types of shuffle than simple flat slices.


Should this TODO be an open issue?

src/StageStridedLoads.cpp

rootjalex · 2022-12-15T01:40:45Z

src/StageStridedLoads.cpp

+ // The loop body definitely runs
+
+ // TODO: worry about different iterations of the loop somehow not
+ // providing the evidence we thought it did.


(same as above) should this TODO be an issue? Or a blocker for this PR? I'm also not quite sure what the TODO means.

src/StageStridedLoads.cpp

test/correctness/simd_op_check_arm.cpp

rootjalex

Lgtm with nits

* Add a pass to do explicit densification of strided loads * densify more types of strided load * Reorder downsample in local laplacian for slightly better performance * Move allocation padding into the IR. Still WIP. * Simplify concat_bits handling * Use evidence from parent scopes to densify * Disallow padding allocations with custom new expressions * Add test for parent scopes * Remove debugging prints. Avoid nested ramps. * Avoid parent scope loops * Update cmakefiles * Fix for large_buffers * Pad stack allocations too * Restore vld2/3/4 generation on non-Apple ARM chips * Appease clang-format and clang-tidy * Silence clang-tidy * Better comments * Comment improvements * Nuke code that reads out of bounds * Fix stage_strided_loads test * Change strategy for loads from external buffers Some backends don't like non-power-of-two vectors. Do two overlapping half-sized loads and shuffle instead of one funny-sized load. * Add explanatory comment to ARM backend * Fix cpp backend shuffling * Fix missing msan annotations * Magnify heap cost effect in stack_vs_heap performance test * Address review comments * clang-tidy * Fix for when same load node occurs in two different allocate nodes

abadams added 16 commits December 8, 2022 10:42

Add a pass to do explicit densification of strided loads

1fd8bb0

densify more types of strided load

009018b

Reorder downsample in local laplacian for slightly better performance

86961c9

Move allocation padding into the IR. Still WIP.

b880336

Simplify concat_bits handling

331e43f

Use evidence from parent scopes to densify

ffe6f0a

Disallow padding allocations with custom new expressions

964c940

Add test for parent scopes

523dc69

Remove debugging prints. Avoid nested ramps.

0260fb9

Avoid parent scope loops

d856df3

Update cmakefiles

aa92026

Fix for large_buffers

69b486d

Pad stack allocations too

37b1cc5

Restore vld2/3/4 generation on non-Apple ARM chips

05733e7

Appease clang-format and clang-tidy

b9e7417

Silence clang-tidy

ae0d0d8

abadams added 2 commits December 12, 2022 09:20

Better comments

0d977e7

Comment improvements

96ac6c0

rootjalex reviewed Dec 12, 2022

View reviewed changes

src/CodeGen_ARM.cpp Outdated Show resolved Hide resolved

abadams added 6 commits December 12, 2022 16:16

Nuke code that reads out of bounds

4f733aa

Fix stage_strided_loads test

ff9a1b6

Change strategy for loads from external buffers

8f22adb

Some backends don't like non-power-of-two vectors. Do two overlapping half-sized loads and shuffle instead of one funny-sized load.

Add explanatory comment to ARM backend

b1dd3a2

Fix cpp backend shuffling

068412f

Fix missing msan annotations

df3bf08

abadams force-pushed the abadams/stage_strided_loads branch from c455859 to df3bf08 Compare December 13, 2022 17:59

abadams added 2 commits December 13, 2022 10:33

Magnify heap cost effect in stack_vs_heap performance test

a1f2a12

Merge branch 'abadams/stage_strided_loads' of https://github.com/hali…

f3a9e11

…de/Halide into abadams/stage_strided_loads

abadams changed the title ~~Explicitly stage strided loads (WIP)~~ Explicitly stage strided loads Dec 13, 2022

steven-johnson requested review from steven-johnson and rootjalex December 15, 2022 01:02

steven-johnson approved these changes Dec 15, 2022

View reviewed changes

rootjalex reviewed Dec 15, 2022

View reviewed changes

src/CodeGen_ARM.cpp Outdated Show resolved Hide resolved

rootjalex reviewed Dec 15, 2022

View reviewed changes

src/StageStridedLoads.cpp Show resolved Hide resolved

rootjalex reviewed Dec 15, 2022

View reviewed changes

src/StageStridedLoads.cpp Outdated Show resolved Hide resolved

rootjalex reviewed Dec 15, 2022

View reviewed changes

test/correctness/simd_op_check_arm.cpp Show resolved Hide resolved

rootjalex approved these changes Dec 15, 2022

View reviewed changes

abadams added 3 commits December 15, 2022 06:44

Address review comments

ee6cea3

clang-tidy

2f7bf16

Fix for when same load node occurs in two different allocate nodes

16b90bb

steven-johnson merged commit 10345d4 into main Dec 16, 2022

steven-johnson deleted the abadams/stage_strided_loads branch December 16, 2022 17:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explicitly stage strided loads #7230

Explicitly stage strided loads #7230

abadams commented Dec 12, 2022 •

edited

Loading

steven-johnson commented Dec 12, 2022

steven-johnson left a comment

rootjalex Dec 15, 2022

rootjalex Dec 15, 2022

rootjalex left a comment

Explicitly stage strided loads #7230

Explicitly stage strided loads #7230

Conversation

abadams commented Dec 12, 2022 • edited Loading

steven-johnson commented Dec 12, 2022

steven-johnson left a comment

Choose a reason for hiding this comment

rootjalex Dec 15, 2022

Choose a reason for hiding this comment

rootjalex Dec 15, 2022

Choose a reason for hiding this comment

rootjalex left a comment

Choose a reason for hiding this comment

abadams commented Dec 12, 2022 •

edited

Loading