Skip to content

Conversation

saucecontrol
Copy link
Member

@saucecontrol saucecontrol commented Jul 14, 2025

This enables embedded broadcast of non-const values in Tier0

Diffs are a net improvement, although there are a few regressions where an extra temp ends up being introduced due to arg swapping.

There are also a few 1- or 2-byte regressions where we swapped from containing a full vector load arg to containing a broadcast arg, which then forces EVEX encoding. It would be interesting to look at optimizing around that (separately -- it would impact FullOpts as well)

@github-actions github-actions bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jul 14, 2025
@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Jul 14, 2025
@saucecontrol
Copy link
Member Author

cc @tannergooding

@tannergooding
Copy link
Member

There are also a few 1- or 2-byte regressions where we swapped from containing a full vector load arg to containing a broadcast arg

We view this as an explicit improvement and the real "issue" is more that SPMI doesn't surface any size savings in the data section size. -- That is, while the codegen is 1-2 bytes bigger, we save 8-60 bytes of data section size and improve cache locality.

@saucecontrol
Copy link
Member Author

We view this as an explicit improvement and the real "issue" is more that SPMI doesn't surface any size savings in the data section size. -- That is, while the codegen is 1-2 bytes bigger, we save 8-60 bytes of data section size and improve cache locality.

The cases I'm referring to are like this:
image

where it's a broadcast either way, and we can contain either the broadcast or the full vector. It's always 2 instructions because they can't both be contained. Switching from containing the full vector to containing the broadcast means you have to switch to EVEX, so it's a net increase in size.

This particular regression only applies to instructions where we swap operands in order to be able to contain one, so I think we could simply give lower preference to CnsVec operands that might be turned into broadcast. Or something like that?

@tannergooding
Copy link
Member

This particular regression only applies to instructions where we swap operands in order to be able to contain one, so I think we could simply give lower preference to CnsVec operands that might be turned into broadcast. Or something like that?

Ah, I see.

Yeah, in general we want to prefer loads from arbitrary memory, then broadcastable constants, then regular constants.

@saucecontrol
Copy link
Member Author

Disabled the aligned load containment. Diffs are smaller but still a net improvement.

@saucecontrol
Copy link
Member Author

I've split the TryFoldCnsVecForEmbeddedBroadcast changes out into to #117700

Copy link
Member

@tannergooding tannergooding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. CC. @dotnet/jit-contrib for secondary review

@tannergooding
Copy link
Member

/ba-g unrelated arm64 timeouts

@tannergooding tannergooding merged commit 0b2f272 into dotnet:main Jul 22, 2025
102 of 110 checks passed
@saucecontrol saucecontrol deleted the more-t0-opts branch July 22, 2025 04:09
@github-actions github-actions bot locked and limited conversation to collaborators Aug 21, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants