Enable NI_Vector256_Create on AVX1 #72522

EgorBo · 2022-07-20T13:22:47Z

ghost · 2022-07-20T13:23:03Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Author:	EgorBo
Assignees:	-
Labels:	`area-CodeGen-coreclr`
Milestone:	-

src/coreclr/jit/hwintrinsic.cpp

EgorBo · 2022-07-20T15:37:23Z

@tannergooding PTAL

tannergooding · 2022-07-20T16:11:44Z

An alternative and likely better fix would to just remove or update the recursive path from the Create calls. Such a path is only used for indirect invocation (reflection, delegates, function pointers, etc) and its not really worth optimizing them when the software fallback already does the right thing. Additionally, it simplifies the handling and reduces risk of other errors due to future ISA or implementation changes.

EgorBo · 2022-07-20T16:58:34Z

An alternative and likely better fix would to just remove or update the recursive path from the Create calls. Such a path is only used for indirect invocation (reflection, delegates, function pointers, etc) and its not really worth optimizing them when the software fallback already does the right thing. Additionally, it simplifies the handling and reduces risk of other errors due to future ISA or implementation changes.

@tannergooding I am not sure I understand how it's better - if we remove the recursive pass we won't be able to optimize Vector256.Create on AVX1 so e.g. Avx.Xor(Vector256.Create(...), Vector256.Create(...)) will be terribly slow.

you probably meant that we can drop the recursive path and keep my JIT change? Because if I revert my change and remove the recursive path it will be slow

tannergooding · 2022-07-20T17:14:32Z

@tannergooding I am not sure I understand how it's better - if we remove the recursive pass we won't be able to optimize Vector256.Create on AVX1 so e.g. Avx.Xor(Vector256.Create(...), Vector256.Create(...)) will be terribly slow.

AVX only hardware is pretty rare (only Sandy Bridge and Ivy Bridge processors) and performance is already going to be problematic in various cases due to other APIs and scenarios not being accelerated.

We also continue hitting various edge cases and hard to find bugs due to the complex relationship between what's available on AVX vs what requires AVX2. If we want to provide AVX only acceleration then that should likely be looked at more broadly and we should consider the best way to model that to ensure that all the "expected" functions work as users require. It's notably too late in the .NET 7 cycle to do that, however, and is decently non-trivial since it requires looking at each Vector256.* ISA and considering when it can light up for each of the supporting types (often float/double being AVX and integer types being AVX2; but there is some intermixing there).

Keeping this so that Vector256.Create is only accelerated when Vector256.IsHardwareAccelerated returns true simplifies the overall handling and ensures that .NET 7 ships more robustly. This will not impact cloud providers (who, to my knowledge, are all on Broadwell or later) and won't impact most end users.

tannergooding · 2022-07-20T17:16:10Z

Taking this change "as is" requires running the outerloop ISA tests and validating that no AVX2 instructions are generated by any path that Vector256.Create can hit, including under GT_CNS_VEC and any other tree shapes that morph, valuenum, lowering, or other phases may introduce.

EgorBo · 2022-07-20T17:16:24Z

@tannergooding I think it's better to do both drop the recursive calls (I've just done it) and still accelerate Vector256.Create for AVX1 (also done).

tannergooding · 2022-07-20T17:17:52Z

and still accelerate Vector256.Create for AVX1 (also done).

This is still going to require running the outerloop ISA stress jobs and confirming that other phases aren't expecting to only be hit when AVX2 is enabled.

EgorBo · 2022-07-20T17:18:02Z

Because it's possible to work with AVX-only in .NET 6.0 via Vector256.Create + Avx.X APIs - I don't see why we want to break it

EgorBo · 2022-07-20T17:20:31Z

/azp list

azure-pipelines · 2022-07-20T17:20:36Z

CI/CD Pipelines for this repository: runtime-coreclr outerloop runtime-coreclr jitstress runtime-coreclr jitstressregs runtime-coreclr jitstress2-jitstressregs runtime-coreclr gcstress0x3-gcstress0xc runtime-coreclr gcstress-extra runtime-coreclr r2r-extra runtime-coreclr jitstress-isas-x86 runtime-coreclr jitstress-isas-arm runtime-coreclr jitstressregs-x86 runtime-coreclr libraries-jitstressregs runtime-coreclr libraries-jitstress2-jitstressregs runtime-coreclr r2r runtime-coreclr runincontext runtime-coreclr crossgen2 runtime-libraries-coreclr outerloop runtime-libraries-coreclr outerloop-windows runtime-libraries-coreclr outerloop-linux runtime-libraries-coreclr outerloop-osx runtime runtime-libraries enterprise-linux runtime-libraries stress-http runtime-libraries stress-ssl runtime-dev-innerloop runtime-coreclr crossgen2 outerloop coreclr-release-outerloop-nightly sync-runtime-to-mono runtime-coreclr crossgen2-composite runtime-jit-experimental runtime-coreclr libraries-jitstress dotnet-linker-tests runtime-coreclr ilasm runtime-coreclr crossgen2-composite gcstress runtime-libraries-mono outerloop runtime-staging runtime-coreclr pgo runtime-coreclr libraries-pgo coreclr-gc-regions Antigen runtime-community Fuzzlyn runtime-coreclr superpmi-replay runtime-wasm runtime-coreclr superpmi-diffs runtime-coreclr superpmi-asmdiffs-checked-release runtime-manual runtime-extra-platforms jit-cfg perf-wasm runtime-llvm

EgorBo · 2022-07-20T17:21:46Z

and still accelerate Vector256.Create for AVX1 (also done).

This is still going to require running the outerloop ISA stress jobs and confirming that other phases aren't expecting to only be hit when AVX2 is enabled.

do you mean runtime outerloop pipeline?

tannergooding · 2022-07-20T17:24:55Z

Because it's possible to work with AVX-only in .NET 6.0 via Vector256.Create + Avx.X APIs - I don't see why we want to break it

It's semi-possible and you run into some poor experiences and bad codegen because the JIT can't optimize everything as it needs. We also hit and had to patch several bugs related to mismatches.

Generating AVX only code (when AVX2 isn't supported) has been one of the biggest issues we've faced because the Vector<T> never had such support and various paths don't expect it. The raw hardware intrinsics mostly mitigate this in that they are directly tied to an ISA/base type. The helper APIs (like Create or GetElement) are not directly tied to an ISA or base type, they are generic and you have to keep in mind in the JIT exactly how each tree shape is handled in the backend.

tannergooding · 2022-07-20T17:25:18Z

/azp run runtime-coreclr jitstress-isas-x86

^ this one (edit: thought this would work since the run command was on a single line, apparently it doesn't >.>)

azure-pipelines · 2022-07-20T17:25:23Z

No pipelines are associated with this pull request.

EgorBo · 2022-07-20T17:25:38Z

/azp run runtime-coreclr jitstress-isas-x86

azure-pipelines · 2022-07-20T17:25:47Z

Azure Pipelines successfully started running 1 pipeline(s).

azure-pipelines · 2022-07-20T17:26:25Z

No pipelines are associated with this pull request.

EgorBo · 2022-07-20T17:29:41Z

@tannergooding sure, I understand your point, but as we now have stronger AOT-only story (NativeAOT) we should be ready for cases when developers limit capabilities to some lower common denominator e.g. --instruction-set avx (even some modern games distribute binaries for avx1 only afaik). So while for JIT ivy-bridge like machines might be rare - it's not the case for AOT case like ^.

tannergooding · 2022-07-20T17:30:41Z

Sure, the main concern here is destabilizing .NET 7 and the work that might be required to ensure that everything end to end works as expected particularly with all the changes and issues that had already cropped up and caused the limitation to be put in for .NET 7

EgorBo · 2022-07-21T13:12:47Z

@tannergooding CI looks good, pri1 isas x86 pipelines are broken #71928

saucecontrol · 2022-07-22T04:59:19Z

Keeping this so that Vector256.Create is only accelerated when Vector256.IsHardwareAccelerated returns true simplifies the overall handling and ensures that .NET 7 ships more robustly. This will not impact cloud providers (who, to my knowledge, are all on Broadwell or later) and won't impact most end users.

This may not impact Azure or AWS, but some of the smaller hosts definitely have these older machines in the mix. I have a client running .NET apps on a large number of Rackspace Cloud VMs, and all the ones I saw in a recent spot check were Sandy Bridge (Xeon E5-2680 to be precise).

Vector256.Create(float) has been accelerated since 3.0 on AVX-only, and no matter how uncommon that hardware is, disabling acceleration now would be a nasty surprise for anyone that's running currently-fast AVX code paths that would be crippled on update to 7.0.

EgorBo · 2022-07-25T14:28:42Z

@tannergooding ping

tannergooding · 2022-07-25T16:37:52Z

Vector256.Create(float) has been accelerated since 3.0 on AVX-only, and no matter how uncommon that hardware is, disabling acceleration now would be a nasty surprise for anyone that's running currently-fast AVX code paths that would be crippled on update to 7.0.

More generally speaking, I'm raising a concern around the number of bugs we've had and continue hitting on trying to split AVX/AVX2 functionality.

There is complexity in ensuring that support exists, that it works well, and that some code path doesn't accidentally pessimize it because it really needs some functionality that was only in AVX2 instead (not to mention that there isn't a bug, of which we've had a few, where AVX2 is generated when only AVX is supported). This is why when AMD, Intel, LLVM, and others got together to define the x86-x64-v* baselines, they were:

v1: CMOV, CMPXCHG8B, SSE, SSE2
v2: ..., CMPXCHG16B, POPCNT, SSE3, SSSE3, SSE4.1, SSE4.2
v3: ..., AVX, AVX2, BMI1, BMI2, FMA, F16C, LZCNT, MOVBE
v4: ..., AVX512F, AVX512BW, AVX512CD, AVX512DQ, AVX512VL

In order to keep things "simple" and avoid explosions around support matrices. Supporting just "SSSE3" or just "AVX" doesn't necessarily make sense and so Vector256.IsHardwareAccelerated reports "false" when only AVX is supported. And while I agree that not regressing an existing scenario of Vector256.Create(float) working on AVX only is goodness, there is a balance in ensuring that the overall support in the JIT remains robust/correct and that we don't unnecessarily complicate the JIT implementation.

Some examples of where this gets complex, even for Create(float) are that VBROADCASTSS ymm, xmm is AVX2; AVX only exposes VBROADCASTSS ymm, mem32. The same goes for VPERMPS. You then get into more complex cases with integer types where Vector256<int> ends up partially accelerated because GetLower/GetUpper and WithLower/WithUpper can be "always accelerated" as can Load, Store, and Test. But other operations can't be. So it becomes a question of "where do we draw the line" and how complex do we make the import checks around whether something can or cannot be intrinsified in importation.

EgorBo · 2022-07-25T17:38:52Z

So it becomes a question of "where do we draw the line" and how complex do we make the import checks around whether something can or cannot be intrinsified in importation.

@tannergooding I agree but It still reports Vector256.IsHardwareAccelerated as false and significantly improves perf for the most common operations Vector256.Create and fixes the unexpected recursion = PlatformNotSupported. Other less common portable APIs can be accelerated as well as we hit them, but they not that common as Create IMO

I do think we need to fix #72506 for 7.0 - there are no good reasons to break what worked in 3.1-6.0

tannergooding · 2022-07-25T18:12:33Z

I do think we need to fix #72506 for 7.0

Yes, 100% agree

there are no good reasons to break what worked in 3.1-6.0

Provided we have sufficient coverage and validation that it is fixed and isn't regressing other key scenarios, and isn't going to be something we're likely to decide to regress in the future due to the complexity required in the JIT, etc.

EgorBo · 2022-07-25T18:19:11Z

@tannergooding so can you approve it?
the jit change is literally few lines of code, CI is happy
the rest is just a clean up to remove the hack that is only improving perf for indirect usages (which are slow by definition) you asked me to do

Enable NI_Vector256_Create on AVX1

22a488d

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jul 20, 2022

ghost assigned EgorBo Jul 20, 2022

Add test

f6f2cbf

EgorBo marked this pull request as ready for review July 20, 2022 13:41

teo-tsirpanis reviewed Jul 20, 2022

View reviewed changes

src/coreclr/jit/hwintrinsic.cpp Outdated Show resolved Hide resolved

Fix build & fix typo

c17297c

Remove recursive calls from Vector_.Create

cce7e2b

runfoapp bot mentioned this pull request Jul 21, 2022

System.IO.Pipes.Tests test doesnt throw #72587

Closed

tannergooding approved these changes Jul 25, 2022

View reviewed changes

EgorBo merged commit fb1d318 into dotnet:main Jul 25, 2022

tarekgh mentioned this pull request Jul 27, 2022

Test failure System.Tests.TimeZoneInfoTests.NJulianRuleTest(posixRule: \"<+00>0<+01>,A/0,J365/25\", dayNumber: 0, monthNumber: 0, shouldSucceed: False) #72825

Closed

tannergooding mentioned this pull request Aug 2, 2022

Fix issue when writing lossy webp with alpha data SixLabors/ImageSharp#2173

Merged

4 tasks

EgorBo deleted the fix-v256-avx2 branch August 6, 2022 22:42

EgorBo mentioned this pull request Aug 6, 2022

Failing tests on .NET7 SixLabors/ImageSharp#2117

Closed

4 tasks

tannergooding mentioned this pull request Aug 10, 2022

Ensure partial Vector256 acceleration is still possible on AVX only hardware #73720

Merged

ghost locked as resolved and limited conversation to collaborators Sep 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable NI_Vector256_Create on AVX1 #72522

Enable NI_Vector256_Create on AVX1 #72522

EgorBo commented Jul 20, 2022

ghost commented Jul 20, 2022

EgorBo commented Jul 20, 2022

tannergooding commented Jul 20, 2022

EgorBo commented Jul 20, 2022 •

edited

Loading

tannergooding commented Jul 20, 2022

tannergooding commented Jul 20, 2022 •

edited

Loading

EgorBo commented Jul 20, 2022

tannergooding commented Jul 20, 2022

EgorBo commented Jul 20, 2022

EgorBo commented Jul 20, 2022

azure-pipelines bot commented Jul 20, 2022

EgorBo commented Jul 20, 2022

tannergooding commented Jul 20, 2022

tannergooding commented Jul 20, 2022 •

edited

Loading

azure-pipelines bot commented Jul 20, 2022

EgorBo commented Jul 20, 2022

azure-pipelines bot commented Jul 20, 2022

azure-pipelines bot commented Jul 20, 2022

EgorBo commented Jul 20, 2022

tannergooding commented Jul 20, 2022 •

edited

Loading

EgorBo commented Jul 21, 2022

saucecontrol commented Jul 22, 2022

EgorBo commented Jul 25, 2022

tannergooding commented Jul 25, 2022

EgorBo commented Jul 25, 2022 •

edited

Loading

tannergooding commented Jul 25, 2022

EgorBo commented Jul 25, 2022 •

edited

Loading

Enable NI_Vector256_Create on AVX1 #72522

Enable NI_Vector256_Create on AVX1 #72522

Conversation

EgorBo commented Jul 20, 2022

ghost commented Jul 20, 2022

EgorBo commented Jul 20, 2022

tannergooding commented Jul 20, 2022

EgorBo commented Jul 20, 2022 • edited Loading

tannergooding commented Jul 20, 2022

tannergooding commented Jul 20, 2022 • edited Loading

EgorBo commented Jul 20, 2022

tannergooding commented Jul 20, 2022

EgorBo commented Jul 20, 2022

EgorBo commented Jul 20, 2022

azure-pipelines bot commented Jul 20, 2022

EgorBo commented Jul 20, 2022

tannergooding commented Jul 20, 2022

tannergooding commented Jul 20, 2022 • edited Loading

azure-pipelines bot commented Jul 20, 2022

EgorBo commented Jul 20, 2022

azure-pipelines bot commented Jul 20, 2022

azure-pipelines bot commented Jul 20, 2022

EgorBo commented Jul 20, 2022

tannergooding commented Jul 20, 2022 • edited Loading

EgorBo commented Jul 21, 2022

saucecontrol commented Jul 22, 2022

EgorBo commented Jul 25, 2022

tannergooding commented Jul 25, 2022

EgorBo commented Jul 25, 2022 • edited Loading

tannergooding commented Jul 25, 2022

EgorBo commented Jul 25, 2022 • edited Loading

EgorBo commented Jul 20, 2022 •

edited

Loading

tannergooding commented Jul 20, 2022 •

edited

Loading

tannergooding commented Jul 20, 2022 •

edited

Loading

tannergooding commented Jul 20, 2022 •

edited

Loading

EgorBo commented Jul 25, 2022 •

edited

Loading

EgorBo commented Jul 25, 2022 •

edited

Loading