Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explicitly stage strided loads #7230

Merged
merged 29 commits into from
Dec 16, 2022
Merged

Conversation

abadams
Copy link
Member

@abadams abadams commented Dec 12, 2022

This PR adds a new compiler pass that converts strided loads into dense loads followed by shuffles.

For a stride of two, the trick is to do a dense load of twice the size, and then extract either the even or odd lanes. This was previously done in codegen, where it was challenging, because it's not easy to know there if it's safe to do the double-sized load, as it either loads one element beyond or before the original load. We used the alignment of the ramp base to try to tell if it was safe to shift backwards, and we added padding to internal allocations so that for those at least it was safe to shift forwards. Unfortunately the alignment of the ramp base is usually unknown if you don't know anything about the strides of the input, and adding padding to allocations was a serious wart in our memory allocators.

This PR instead actively looks for evidence elsewhere in the Stmt (at some location which definitely executes whenever the load being transformed executes) that it's safe to read further forwards or backwards in memory. The evidence is in the form of a load at the same base address with a different constant offset. It also clusters groups of these loads so that they do the same dense load each and extract the appropriate slice of lanes. For loads from external buffers, it just does a shorter load, and for loads from internal allocations that we can't shift forwards or backwards, it adds padding to the allocation explicitly, via a new padding field on Allocate nodes.

Edit: Some backends don't like non-power-of-two sized vectors, so for loads from external buffers, we now split the load into two overlapping dense loads of half the size, then shuffle out appropriate lanes from each and concat the results.

@steven-johnson
Copy link
Contributor

The excellent PR description of what's going on now (and what we used to do) should be captured in a code comment somewhere.

src/CodeGen_ARM.cpp Outdated Show resolved Hide resolved
@abadams abadams changed the title Explicitly stage strided loads (WIP) Explicitly stage strided loads Dec 13, 2022
Copy link
Contributor

@steven-johnson steven-johnson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM -- don't see any red flags inside Google.

src/CodeGen_ARM.cpp Outdated Show resolved Hide resolved
}
// TODO: We do not yet handle nested vectorization here for
// ramps which have not already collapsed. We could potentially
// handle more interesting types of shuffle than simple flat slices.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this TODO be an open issue?

// The loop body definitely runs

// TODO: worry about different iterations of the loop somehow not
// providing the evidence we thought it did.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(same as above) should this TODO be an issue? Or a blocker for this PR? I'm also not quite sure what the TODO means.

Copy link
Member

@rootjalex rootjalex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm with nits

@steven-johnson steven-johnson merged commit 10345d4 into main Dec 16, 2022
@steven-johnson steven-johnson deleted the abadams/stage_strided_loads branch December 16, 2022 17:56
ardier pushed a commit to ardier/Halide-mutation that referenced this pull request Mar 3, 2024
* Add a pass to do explicit densification of strided loads

* densify more types of strided load

* Reorder downsample in local laplacian for slightly better performance

* Move allocation padding into the IR. Still WIP.

* Simplify concat_bits handling

* Use evidence from parent scopes to densify

* Disallow padding allocations with custom new expressions

* Add test for parent scopes

* Remove debugging prints. Avoid nested ramps.

* Avoid parent scope loops

* Update cmakefiles

* Fix for large_buffers

* Pad stack allocations too

* Restore vld2/3/4 generation on non-Apple ARM chips

* Appease clang-format and clang-tidy

* Silence clang-tidy

* Better comments

* Comment improvements

* Nuke code that reads out of bounds

* Fix stage_strided_loads test

* Change strategy for loads from external buffers

Some backends don't like non-power-of-two vectors. Do two overlapping
half-sized loads and shuffle instead of one funny-sized load.

* Add explanatory comment to ARM backend

* Fix cpp backend shuffling

* Fix missing msan annotations

* Magnify heap cost effect in stack_vs_heap performance test

* Address review comments

* clang-tidy

* Fix for when same load node occurs in two different allocate nodes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants