Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize FMA codegen base on the overwritten #58196

Merged
merged 40 commits into from
Dec 1, 2021
Merged

Conversation

weilinwa
Copy link
Contributor

This is for #12984. @kunalspathak @tannergooding, thanks!

@ghost ghost added the community-contribution Indicates that the PR has been added by a community member label Aug 26, 2021
@dotnet-issue-labeler dotnet-issue-labeler bot added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI and removed community-contribution Indicates that the PR has been added by a community member labels Aug 26, 2021
@ghost
Copy link

ghost commented Aug 26, 2021

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

This is for #12984. @kunalspathak @tannergooding, thanks!

Author: weilinwa
Assignees: -
Labels:

area-CodeGen-coreclr

Milestone: -

@JulieLeeMSFT JulieLeeMSFT added this to the 7.0.0 milestone Aug 26, 2021
Copy link
Contributor

@SingleAccretion SingleAccretion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some questions and suggestions.

src/coreclr/jit/gentree.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/gentree.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/gentree.cpp Show resolved Hide resolved
src/coreclr/jit/gentree.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/gentree.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/gentree.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/gentree.cpp Outdated Show resolved Hide resolved
Copy link
Member

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comments. Did you run the superpmi asmdiff?

src/coreclr/jit/gentree.cpp Show resolved Hide resolved
src/coreclr/jit/gentree.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/hwintrinsiccodegenxarch.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/hwintrinsiccodegenxarch.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/lsraxarch.cpp Outdated Show resolved Hide resolved
@weilinwa
Copy link
Contributor Author

Added some comments. Did you run the superpmi asmdiff?

No, I haven't. What is this for?

@SingleAccretion
Copy link
Contributor

What is this for?

Pretty much all Jit changes are run through diffs (and SPMI is probably the most convenient tool for getting them), so that we can asses the impact on the generated code and how much existing test coverage we have.

@weilinwa
Copy link
Contributor Author

What is this for?

Pretty much all Jit changes are run through diffs (and SPMI is probably the most convenient tool for getting them), so that we can asses the impact on the generated code and how much existing test coverage we have.

@SingleAccretion, I ran the SuperPMI.py with asmdiffs and saw quite some errors, most of which are from "JIT.HardwareIntrinsics.Arm.Helpers:FPRSqrtStepFused(float,float):float" or other similar tests. How can I find these tests to debug? And, it's very confusing that my change is suppose to only work for xarch, why Arm tests are complaining.

@SingleAccretion
Copy link
Contributor

How can I find these tests to debug?

@weilinwa One of the nicest things with SPMI is that it makes debugging easy. When you encountered errors (I presume asserts), the tool should've printed a "reproduction command", with the path to the native SPMI executable and a list of .mcs. From there it should be straightforward to use any native debugger (I personally use VS's "executable project" feature) to drill into the code (I recommend using the Debug builds of native SPMI and Jit for this, the script uses Checked by default).

And, it's very confusing that my change is suppose to only work for xarch, why Arm tests are complaining.

I am not sure why that is either.

@weilinwa
Copy link
Contributor Author

How can I find these tests to debug?

@weilinwa One of the nicest things with SPMI is that it makes debugging easy. When you encountered errors (I presume asserts), the tool should've printed a "reproduction command", with the path to the native SPMI executable and a list of .mcs. From there it should be straightforward to use any native debugger (I personally use VS's "executable project" feature) to drill into the code (I recommend using the Debug builds of native SPMI and Jit for this, the script uses Checked by default).

And, it's very confusing that my change is suppose to only work for xarch, why Arm tests are complaining.

I am not sure why that is either.

@SingleAccretion , I got the "Error: no baseline JIT found" when run the asmdiffs with -build_type Release or -build_type Debug. Only the Checked worked for me. Are the options I used correct?

@SingleAccretion
Copy link
Contributor

Only the Checked worked for me. Are the options I used correct?

Yes. I believe we only have prebuilt Jits for the Checked config. That said, you can of course supply your own Jit for the base (or diff) via the -base/diff_jit_path options.

@kunalspathak
Copy link
Member

@SingleAccretion , I got the "Error: no baseline JIT found" when run the asmdiffs with -build_type Release or -build_type Debug. Only the Checked worked for me. Are the options I used correct?

Correct way to use this is:

python superpmi.py asmdiffs -f benchmarks -base_jit_path path\to\before\clrjit_win_x64_x64.dll -diff_jit_path path\to\after\clrjit_win_x64_x64.dll -target_os windows -target_arch x64

python superpmi.py asmdiffs -f benchmarks -base_jit_path path\to\before\clrjit_unix_x64_x64.dll -diff_jit_path path\to\after\clrjit_unix_x64_x64.dll -target_os Linux -target_arch x64

This will do asmdiff for benchmark collection. You might want to also try libraries.pmi (.NET core libraries methods), coreclr_tests (test cases) and asp (asp.net benchmark).

@weilinwa
Copy link
Contributor Author

weilinwa commented Sep 9, 2021

@kunalspathak @tannergooding, I've modified the code logic to check different IsContainableHWIntrinsicOp() possibilities under each cases of overwrittenOpNum. Please take a look.

Asm diffs

benchmarks.run.windows.x64.checked.mch:


Summary of Code Size diffs:
(Lower is better)

Total bytes of base: 1071
Total bytes of diff: 1083
Total bytes of delta: 12 (1.12% of base)
Total relative delta: 0.08
    diff is a regression.
    relative diff is a regression.
Detail diffs


Top file regressions (bytes):
          12 : 12262.dasm (7.55% of base)

1 total files with Code Size differences (0 improved, 1 regressed), 2 unchanged.

Top method regressions (bytes):
          12 ( 7.55% of base) : 12262.dasm - System.Numerics.Matrix4x4:Lerp(System.Numerics.Matrix4x4,System.Numerics.Matrix4x4,float):System.Numerics.Matrix4x4

Top method regressions (percentages):
          12 ( 7.55% of base) : 12262.dasm - System.Numerics.Matrix4x4:Lerp(System.Numerics.Matrix4x4,System.Numerics.Matrix4x4,float):System.Numerics.Matrix4x4

1 total methods with Code Size differences (0 improved, 1 regressed), 2 unchanged.


coreclr_tests.pmi.windows.x64.checked.mch:


Summary of Code Size diffs:
(Lower is better)

Total bytes of base: 62236
Total bytes of diff: 62336
Total bytes of delta: 100 (0.16% of base)
Total relative delta: 8.10
    diff is a regression.
    relative diff is a regression.
Detail diffs


Top file regressions (bytes):
           4 : 118516.dasm (23.53% of base)
           4 : 126128.dasm (23.53% of base)
           4 : 170806.dasm (23.53% of base)
           4 : 129338.dasm (23.53% of base)
           4 : 219361.dasm (5.56% of base)
           4 : 137701.dasm (23.53% of base)
           4 : 6028.dasm (23.53% of base)
           4 : 219357.dasm (4.55% of base)
           4 : 219358.dasm (5.56% of base)
           4 : 6042.dasm (23.53% of base)
           4 : 134051.dasm (23.53% of base)
           4 : 131147.dasm (23.53% of base)
           4 : 134065.dasm (23.53% of base)
           4 : 219351.dasm (5.56% of base)
           4 : 219354.dasm (6.25% of base)
           4 : 112774.dasm (23.53% of base)
           4 : 219359.dasm (5.00% of base)
           4 : 219360.dasm (5.00% of base)
           4 : 117293.dasm (23.53% of base)
           4 : 43453.dasm (23.53% of base)

Top file improvements (bytes):
         -19 : 219345.dasm (-6.71% of base)
         -19 : 219328.dasm (-6.86% of base)
         -17 : 219347.dasm (-4.51% of base)
         -11 : 219330.dasm (-2.99% of base)
          -4 : 84103.dasm (-10.81% of base)
          -1 : 141012.dasm (-0.16% of base)
          -1 : 140956.dasm (-0.16% of base)
          -1 : 141076.dasm (-0.16% of base)
          -1 : 141416.dasm (-0.16% of base)
          -1 : 141440.dasm (-0.16% of base)
          -1 : 141060.dasm (-0.16% of base)
          -1 : 141400.dasm (-0.16% of base)
          -1 : 141432.dasm (-0.16% of base)
          -1 : 141464.dasm (-0.16% of base)
          -1 : 141448.dasm (-0.16% of base)
          -1 : 219333.dasm (-0.61% of base)
          -1 : 141424.dasm (-0.16% of base)
          -1 : 141052.dasm (-0.16% of base)
          -1 : 141408.dasm (-0.16% of base)
          -1 : 140980.dasm (-0.16% of base)

85 total files with Code Size differences (35 improved, 50 regressed), 334 unchanged.

Top method regressions (bytes):
           4 (23.53% of base) : 170806.dasm - JIT.HardwareIntrinsics.Arm.Helpers:FPRecipStepFused(double,double):double
           4 (23.53% of base) : 129338.dasm - JIT.HardwareIntrinsics.Arm.Helpers:FPRecipStepFused(double,double):double
           4 (23.53% of base) : 137701.dasm - JIT.HardwareIntrinsics.Arm.Helpers:FPRecipStepFused(double,double):double
           4 (23.53% of base) : 6042.dasm - JIT.HardwareIntrinsics.Arm.Helpers:FPRecipStepFused(double,double):double
           4 (23.53% of base) : 131147.dasm - JIT.HardwareIntrinsics.Arm.Helpers:FPRecipStepFused(double,double):double
           4 (23.53% of base) : 134065.dasm - JIT.HardwareIntrinsics.Arm.Helpers:FPRecipStepFused(double,double):double
           4 (23.53% of base) : 43453.dasm - JIT.HardwareIntrinsics.Arm.Helpers:FPRecipStepFused(double,double):double
           4 (23.53% of base) : 124425.dasm - JIT.HardwareIntrinsics.Arm.Helpers:FPRecipStepFused(double,double):double
           4 (23.53% of base) : 111033.dasm - JIT.HardwareIntrinsics.Arm.Helpers:FPRecipStepFused(double,double):double
           4 (23.53% of base) : 120255.dasm - JIT.HardwareIntrinsics.Arm.Helpers:FPRecipStepFused(double,double):double
           4 (23.53% of base) : 135993.dasm - JIT.HardwareIntrinsics.Arm.Helpers:FPRecipStepFused(double,double):double
           4 (23.53% of base) : 118530.dasm - JIT.HardwareIntrinsics.Arm.Helpers:FPRecipStepFused(double,double):double
           4 (23.53% of base) : 126142.dasm - JIT.HardwareIntrinsics.Arm.Helpers:FPRecipStepFused(double,double):double
           4 (23.53% of base) : 171556.dasm - JIT.HardwareIntrinsics.Arm.Helpers:FPRecipStepFused(double,double):double
           4 (23.53% of base) : 117307.dasm - JIT.HardwareIntrinsics.Arm.Helpers:FPRecipStepFused(double,double):double
           4 (23.53% of base) : 112788.dasm - JIT.HardwareIntrinsics.Arm.Helpers:FPRecipStepFused(double,double):double
           4 (23.53% of base) : 118516.dasm - JIT.HardwareIntrinsics.Arm.Helpers:FPRecipStepFused(float,float):float
           4 (23.53% of base) : 126128.dasm - JIT.HardwareIntrinsics.Arm.Helpers:FPRecipStepFused(float,float):float
           4 (23.53% of base) : 6028.dasm - JIT.HardwareIntrinsics.Arm.Helpers:FPRecipStepFused(float,float):float
           4 (23.53% of base) : 134051.dasm - JIT.HardwareIntrinsics.Arm.Helpers:FPRecipStepFused(float,float):float

Top method improvements (bytes):
         -19 (-6.71% of base) : 219345.dasm - MathFusedMultiplyAddTest.Program:TestExplicitFmaUsage1(byref,double)
         -19 (-6.86% of base) : 219328.dasm - MathFusedMultiplyAddTest.Program:TestExplicitFmaUsage1(byref,float)
         -17 (-4.51% of base) : 219347.dasm - MathFusedMultiplyAddTest.Program:TestExplicitFmaUsage3(byref,double)
         -11 (-2.99% of base) : 219330.dasm - MathFusedMultiplyAddTest.Program:TestExplicitFmaUsage3(byref,float)
          -4 (-10.81% of base) : 84103.dasm - Runtime_39424:TestLclFldAddrIntrinsicsFMA_MulipluAddScalar():double
          -1 (-0.65% of base) : 219313.dasm - MathFusedMultiplyAddTest.Program:TestExplicitFmaUsage4(byref,double)
          -1 (-0.62% of base) : 219331.dasm - MathFusedMultiplyAddTest.Program:TestExplicitFmaUsage4(byref,float)
          -1 (-0.52% of base) : 219314.dasm - MathFusedMultiplyAddTest.Program:TestExplicitFmaUsage5(byref,double)
          -1 (-0.49% of base) : 219332.dasm - MathFusedMultiplyAddTest.Program:TestExplicitFmaUsage5(byref,float)
          -1 (-0.61% of base) : 219315.dasm - MathFusedMultiplyAddTest.Program:TestExplicitFmaUsage6(byref,double)
          -1 (-0.61% of base) : 219333.dasm - MathFusedMultiplyAddTest.Program:TestExplicitFmaUsage6(byref,float)
          -1 (-0.16% of base) : 141408.dasm - TestStruct:RunStructFldScenario_Load(JIT.HardwareIntrinsics.X86.AlternatingTernaryOpTest__MultiplyAddSubtractDouble):this
          -1 (-0.16% of base) : 141004.dasm - TestStruct:RunStructFldScenario_Load(JIT.HardwareIntrinsics.X86.AlternatingTernaryOpTest__MultiplyAddSubtractDouble):this
          -1 (-0.16% of base) : 141012.dasm - TestStruct:RunStructFldScenario_Load(JIT.HardwareIntrinsics.X86.AlternatingTernaryOpTest__MultiplyAddSubtractSingle):this
          -1 (-0.16% of base) : 141416.dasm - TestStruct:RunStructFldScenario_Load(JIT.HardwareIntrinsics.X86.AlternatingTernaryOpTest__MultiplyAddSubtractSingle):this
          -1 (-0.16% of base) : 141440.dasm - TestStruct:RunStructFldScenario_Load(JIT.HardwareIntrinsics.X86.AlternatingTernaryOpTest__MultiplySubtractAddDouble):this
          -1 (-0.16% of base) : 141052.dasm - TestStruct:RunStructFldScenario_Load(JIT.HardwareIntrinsics.X86.AlternatingTernaryOpTest__MultiplySubtractAddDouble):this
          -1 (-0.16% of base) : 141060.dasm - TestStruct:RunStructFldScenario_Load(JIT.HardwareIntrinsics.X86.AlternatingTernaryOpTest__MultiplySubtractAddSingle):this
          -1 (-0.16% of base) : 141448.dasm - TestStruct:RunStructFldScenario_Load(JIT.HardwareIntrinsics.X86.AlternatingTernaryOpTest__MultiplySubtractAddSingle):this
          -1 (-0.16% of base) : 140956.dasm - TestStruct:RunStructFldScenario_Load(JIT.HardwareIntrinsics.X86.SimpleTernaryOpTest__MultiplyAddDouble):this

Top method regressions (percentages):
           4 (23.53% of base) : 170806.dasm - JIT.HardwareIntrinsics.Arm.Helpers:FPRecipStepFused(double,double):double
           4 (23.53% of base) : 129338.dasm - JIT.HardwareIntrinsics.Arm.Helpers:FPRecipStepFused(double,double):double
           4 (23.53% of base) : 137701.dasm - JIT.HardwareIntrinsics.Arm.Helpers:FPRecipStepFused(double,double):double
           4 (23.53% of base) : 6042.dasm - JIT.HardwareIntrinsics.Arm.Helpers:FPRecipStepFused(double,double):double
           4 (23.53% of base) : 131147.dasm - JIT.HardwareIntrinsics.Arm.Helpers:FPRecipStepFused(double,double):double
           4 (23.53% of base) : 134065.dasm - JIT.HardwareIntrinsics.Arm.Helpers:FPRecipStepFused(double,double):double
           4 (23.53% of base) : 43453.dasm - JIT.HardwareIntrinsics.Arm.Helpers:FPRecipStepFused(double,double):double
           4 (23.53% of base) : 124425.dasm - JIT.HardwareIntrinsics.Arm.Helpers:FPRecipStepFused(double,double):double
           4 (23.53% of base) : 111033.dasm - JIT.HardwareIntrinsics.Arm.Helpers:FPRecipStepFused(double,double):double
           4 (23.53% of base) : 120255.dasm - JIT.HardwareIntrinsics.Arm.Helpers:FPRecipStepFused(double,double):double
           4 (23.53% of base) : 135993.dasm - JIT.HardwareIntrinsics.Arm.Helpers:FPRecipStepFused(double,double):double
           4 (23.53% of base) : 118530.dasm - JIT.HardwareIntrinsics.Arm.Helpers:FPRecipStepFused(double,double):double
           4 (23.53% of base) : 126142.dasm - JIT.HardwareIntrinsics.Arm.Helpers:FPRecipStepFused(double,double):double
           4 (23.53% of base) : 171556.dasm - JIT.HardwareIntrinsics.Arm.Helpers:FPRecipStepFused(double,double):double
           4 (23.53% of base) : 117307.dasm - JIT.HardwareIntrinsics.Arm.Helpers:FPRecipStepFused(double,double):double
           4 (23.53% of base) : 112788.dasm - JIT.HardwareIntrinsics.Arm.Helpers:FPRecipStepFused(double,double):double
           4 (23.53% of base) : 118516.dasm - JIT.HardwareIntrinsics.Arm.Helpers:FPRecipStepFused(float,float):float
           4 (23.53% of base) : 126128.dasm - JIT.HardwareIntrinsics.Arm.Helpers:FPRecipStepFused(float,float):float
           4 (23.53% of base) : 6028.dasm - JIT.HardwareIntrinsics.Arm.Helpers:FPRecipStepFused(float,float):float
           4 (23.53% of base) : 134051.dasm - JIT.HardwareIntrinsics.Arm.Helpers:FPRecipStepFused(float,float):float

Top method improvements (percentages):
          -4 (-10.81% of base) : 84103.dasm - Runtime_39424:TestLclFldAddrIntrinsicsFMA_MulipluAddScalar():double
         -19 (-6.86% of base) : 219328.dasm - MathFusedMultiplyAddTest.Program:TestExplicitFmaUsage1(byref,float)
         -19 (-6.71% of base) : 219345.dasm - MathFusedMultiplyAddTest.Program:TestExplicitFmaUsage1(byref,double)
         -17 (-4.51% of base) : 219347.dasm - MathFusedMultiplyAddTest.Program:TestExplicitFmaUsage3(byref,double)
         -11 (-2.99% of base) : 219330.dasm - MathFusedMultiplyAddTest.Program:TestExplicitFmaUsage3(byref,float)
          -1 (-0.65% of base) : 219313.dasm - MathFusedMultiplyAddTest.Program:TestExplicitFmaUsage4(byref,double)
          -1 (-0.62% of base) : 219331.dasm - MathFusedMultiplyAddTest.Program:TestExplicitFmaUsage4(byref,float)
          -1 (-0.61% of base) : 219315.dasm - MathFusedMultiplyAddTest.Program:TestExplicitFmaUsage6(byref,double)
          -1 (-0.61% of base) : 219333.dasm - MathFusedMultiplyAddTest.Program:TestExplicitFmaUsage6(byref,float)
          -1 (-0.52% of base) : 219314.dasm - MathFusedMultiplyAddTest.Program:TestExplicitFmaUsage5(byref,double)
          -1 (-0.49% of base) : 219332.dasm - MathFusedMultiplyAddTest.Program:TestExplicitFmaUsage5(byref,float)
          -1 (-0.16% of base) : 141004.dasm - TestStruct:RunStructFldScenario_Load(JIT.HardwareIntrinsics.X86.AlternatingTernaryOpTest__MultiplyAddSubtractDouble):this
          -1 (-0.16% of base) : 141012.dasm - TestStruct:RunStructFldScenario_Load(JIT.HardwareIntrinsics.X86.AlternatingTernaryOpTest__MultiplyAddSubtractSingle):this
          -1 (-0.16% of base) : 141052.dasm - TestStruct:RunStructFldScenario_Load(JIT.HardwareIntrinsics.X86.AlternatingTernaryOpTest__MultiplySubtractAddDouble):this
          -1 (-0.16% of base) : 141060.dasm - TestStruct:RunStructFldScenario_Load(JIT.HardwareIntrinsics.X86.AlternatingTernaryOpTest__MultiplySubtractAddSingle):this
          -1 (-0.16% of base) : 140956.dasm - TestStruct:RunStructFldScenario_Load(JIT.HardwareIntrinsics.X86.SimpleTernaryOpTest__MultiplyAddDouble):this
          -1 (-0.16% of base) : 140972.dasm - TestStruct:RunStructFldScenario_Load(JIT.HardwareIntrinsics.X86.SimpleTernaryOpTest__MultiplyAddNegatedDouble):this
          -1 (-0.16% of base) : 140980.dasm - TestStruct:RunStructFldScenario_Load(JIT.HardwareIntrinsics.X86.SimpleTernaryOpTest__MultiplyAddNegatedSingle):this
          -1 (-0.16% of base) : 140964.dasm - TestStruct:RunStructFldScenario_Load(JIT.HardwareIntrinsics.X86.SimpleTernaryOpTest__MultiplyAddSingle):this
          -1 (-0.16% of base) : 141036.dasm - TestStruct:RunStructFldScenario_Load(JIT.HardwareIntrinsics.X86.SimpleTernaryOpTest__MultiplySubtractDouble):this

85 total methods with Code Size differences (35 improved, 50 regressed), 334 unchanged.


libraries.pmi.windows.x64.checked.mch:


Summary of Code Size diffs:
(Lower is better)

Total bytes of base: 836
Total bytes of diff: 860
Total bytes of delta: 24 (2.87% of base)
Total relative delta: 0.91
    diff is a regression.
    relative diff is a regression.
Detail diffs


Top file regressions (bytes):
           1 : 18759.dasm (3.57% of base)
           1 : 18779.dasm (3.57% of base)
           1 : 18765.dasm (3.57% of base)
           1 : 18762.dasm (4.00% of base)
           1 : 18756.dasm (4.00% of base)
           1 : 18763.dasm (4.00% of base)
           1 : 18758.dasm (3.57% of base)
           1 : 18782.dasm (4.00% of base)
           1 : 18768.dasm (3.57% of base)
           1 : 18767.dasm (4.00% of base)
           1 : 18772.dasm (4.00% of base)
           1 : 18783.dasm (4.00% of base)
           1 : 18773.dasm (4.00% of base)
           1 : 18785.dasm (3.57% of base)
           1 : 18764.dasm (3.57% of base)
           1 : 18774.dasm (3.57% of base)
           1 : 18775.dasm (3.57% of base)
           1 : 18757.dasm (4.00% of base)
           1 : 18769.dasm (3.57% of base)
           1 : 18776.dasm (4.00% of base)

24 total files with Code Size differences (0 improved, 24 regressed), 8 unchanged.

Top method regressions (bytes):
           1 ( 4.00% of base) : 18757.dasm - System.Runtime.Intrinsics.X86.Fma:MultiplyAdd(System.Runtime.Intrinsics.Vector128`1[Double],System.Runtime.Intrinsics.Vector128`1[Double],System.Runtime.Intrinsics.Vector128`1[Double]):System.Runtime.Intrinsics.Vector128`1[Double]
           1 ( 4.00% of base) : 18756.dasm - System.Runtime.Intrinsics.X86.Fma:MultiplyAdd(System.Runtime.Intrinsics.Vector128`1[Single],System.Runtime.Intrinsics.Vector128`1[Single],System.Runtime.Intrinsics.Vector128`1[Single]):System.Runtime.Intrinsics.Vector128`1[Single]
           1 ( 3.57% of base) : 18759.dasm - System.Runtime.Intrinsics.X86.Fma:MultiplyAdd(System.Runtime.Intrinsics.Vector256`1[Double],System.Runtime.Intrinsics.Vector256`1[Double],System.Runtime.Intrinsics.Vector256`1[Double]):System.Runtime.Intrinsics.Vector256`1[Double]
           1 ( 3.57% of base) : 18758.dasm - System.Runtime.Intrinsics.X86.Fma:MultiplyAdd(System.Runtime.Intrinsics.Vector256`1[Single],System.Runtime.Intrinsics.Vector256`1[Single],System.Runtime.Intrinsics.Vector256`1[Single]):System.Runtime.Intrinsics.Vector256`1[Single]
           1 ( 4.00% of base) : 18777.dasm - System.Runtime.Intrinsics.X86.Fma:MultiplyAddNegated(System.Runtime.Intrinsics.Vector128`1[Double],System.Runtime.Intrinsics.Vector128`1[Double],System.Runtime.Intrinsics.Vector128`1[Double]):System.Runtime.Intrinsics.Vector128`1[Double]
           1 ( 4.00% of base) : 18776.dasm - System.Runtime.Intrinsics.X86.Fma:MultiplyAddNegated(System.Runtime.Intrinsics.Vector128`1[Single],System.Runtime.Intrinsics.Vector128`1[Single],System.Runtime.Intrinsics.Vector128`1[Single]):System.Runtime.Intrinsics.Vector128`1[Single]
           1 ( 3.57% of base) : 18779.dasm - System.Runtime.Intrinsics.X86.Fma:MultiplyAddNegated(System.Runtime.Intrinsics.Vector256`1[Double],System.Runtime.Intrinsics.Vector256`1[Double],System.Runtime.Intrinsics.Vector256`1[Double]):System.Runtime.Intrinsics.Vector256`1[Double]
           1 ( 3.57% of base) : 18778.dasm - System.Runtime.Intrinsics.X86.Fma:MultiplyAddNegated(System.Runtime.Intrinsics.Vector256`1[Single],System.Runtime.Intrinsics.Vector256`1[Single],System.Runtime.Intrinsics.Vector256`1[Single]):System.Runtime.Intrinsics.Vector256`1[Single]
           1 ( 4.00% of base) : 18763.dasm - System.Runtime.Intrinsics.X86.Fma:MultiplyAddSubtract(System.Runtime.Intrinsics.Vector128`1[Double],System.Runtime.Intrinsics.Vector128`1[Double],System.Runtime.Intrinsics.Vector128`1[Double]):System.Runtime.Intrinsics.Vector128`1[Double]
           1 ( 4.00% of base) : 18762.dasm - System.Runtime.Intrinsics.X86.Fma:MultiplyAddSubtract(System.Runtime.Intrinsics.Vector128`1[Single],System.Runtime.Intrinsics.Vector128`1[Single],System.Runtime.Intrinsics.Vector128`1[Single]):System.Runtime.Intrinsics.Vector128`1[Single]
           1 ( 3.57% of base) : 18765.dasm - System.Runtime.Intrinsics.X86.Fma:MultiplyAddSubtract(System.Runtime.Intrinsics.Vector256`1[Double],System.Runtime.Intrinsics.Vector256`1[Double],System.Runtime.Intrinsics.Vector256`1[Double]):System.Runtime.Intrinsics.Vector256`1[Double]
           1 ( 3.57% of base) : 18764.dasm - System.Runtime.Intrinsics.X86.Fma:MultiplyAddSubtract(System.Runtime.Intrinsics.Vector256`1[Single],System.Runtime.Intrinsics.Vector256`1[Single],System.Runtime.Intrinsics.Vector256`1[Single]):System.Runtime.Intrinsics.Vector256`1[Single]
           1 ( 4.00% of base) : 18767.dasm - System.Runtime.Intrinsics.X86.Fma:MultiplySubtract(System.Runtime.Intrinsics.Vector128`1[Double],System.Runtime.Intrinsics.Vector128`1[Double],System.Runtime.Intrinsics.Vector128`1[Double]):System.Runtime.Intrinsics.Vector128`1[Double]
           1 ( 4.00% of base) : 18766.dasm - System.Runtime.Intrinsics.X86.Fma:MultiplySubtract(System.Runtime.Intrinsics.Vector128`1[Single],System.Runtime.Intrinsics.Vector128`1[Single],System.Runtime.Intrinsics.Vector128`1[Single]):System.Runtime.Intrinsics.Vector128`1[Single]
           1 ( 3.57% of base) : 18769.dasm - System.Runtime.Intrinsics.X86.Fma:MultiplySubtract(System.Runtime.Intrinsics.Vector256`1[Double],System.Runtime.Intrinsics.Vector256`1[Double],System.Runtime.Intrinsics.Vector256`1[Double]):System.Runtime.Intrinsics.Vector256`1[Double]
           1 ( 3.57% of base) : 18768.dasm - System.Runtime.Intrinsics.X86.Fma:MultiplySubtract(System.Runtime.Intrinsics.Vector256`1[Single],System.Runtime.Intrinsics.Vector256`1[Single],System.Runtime.Intrinsics.Vector256`1[Single]):System.Runtime.Intrinsics.Vector256`1[Single]
           1 ( 4.00% of base) : 18773.dasm - System.Runtime.Intrinsics.X86.Fma:MultiplySubtractAdd(System.Runtime.Intrinsics.Vector128`1[Double],System.Runtime.Intrinsics.Vector128`1[Double],System.Runtime.Intrinsics.Vector128`1[Double]):System.Runtime.Intrinsics.Vector128`1[Double]
           1 ( 4.00% of base) : 18772.dasm - System.Runtime.Intrinsics.X86.Fma:MultiplySubtractAdd(System.Runtime.Intrinsics.Vector128`1[Single],System.Runtime.Intrinsics.Vector128`1[Single],System.Runtime.Intrinsics.Vector128`1[Single]):System.Runtime.Intrinsics.Vector128`1[Single]
           1 ( 3.57% of base) : 18775.dasm - System.Runtime.Intrinsics.X86.Fma:MultiplySubtractAdd(System.Runtime.Intrinsics.Vector256`1[Double],System.Runtime.Intrinsics.Vector256`1[Double],System.Runtime.Intrinsics.Vector256`1[Double]):System.Runtime.Intrinsics.Vector256`1[Double]
           1 ( 3.57% of base) : 18774.dasm - System.Runtime.Intrinsics.X86.Fma:MultiplySubtractAdd(System.Runtime.Intrinsics.Vector256`1[Single],System.Runtime.Intrinsics.Vector256`1[Single],System.Runtime.Intrinsics.Vector256`1[Single]):System.Runtime.Intrinsics.Vector256`1[Single]

Top method regressions (percentages):
           1 ( 4.00% of base) : 18757.dasm - System.Runtime.Intrinsics.X86.Fma:MultiplyAdd(System.Runtime.Intrinsics.Vector128`1[Double],System.Runtime.Intrinsics.Vector128`1[Double],System.Runtime.Intrinsics.Vector128`1[Double]):System.Runtime.Intrinsics.Vector128`1[Double]
           1 ( 4.00% of base) : 18756.dasm - System.Runtime.Intrinsics.X86.Fma:MultiplyAdd(System.Runtime.Intrinsics.Vector128`1[Single],System.Runtime.Intrinsics.Vector128`1[Single],System.Runtime.Intrinsics.Vector128`1[Single]):System.Runtime.Intrinsics.Vector128`1[Single]
           1 ( 4.00% of base) : 18777.dasm - System.Runtime.Intrinsics.X86.Fma:MultiplyAddNegated(System.Runtime.Intrinsics.Vector128`1[Double],System.Runtime.Intrinsics.Vector128`1[Double],System.Runtime.Intrinsics.Vector128`1[Double]):System.Runtime.Intrinsics.Vector128`1[Double]
           1 ( 4.00% of base) : 18776.dasm - System.Runtime.Intrinsics.X86.Fma:MultiplyAddNegated(System.Runtime.Intrinsics.Vector128`1[Single],System.Runtime.Intrinsics.Vector128`1[Single],System.Runtime.Intrinsics.Vector128`1[Single]):System.Runtime.Intrinsics.Vector128`1[Single]
           1 ( 4.00% of base) : 18763.dasm - System.Runtime.Intrinsics.X86.Fma:MultiplyAddSubtract(System.Runtime.Intrinsics.Vector128`1[Double],System.Runtime.Intrinsics.Vector128`1[Double],System.Runtime.Intrinsics.Vector128`1[Double]):System.Runtime.Intrinsics.Vector128`1[Double]
           1 ( 4.00% of base) : 18762.dasm - System.Runtime.Intrinsics.X86.Fma:MultiplyAddSubtract(System.Runtime.Intrinsics.Vector128`1[Single],System.Runtime.Intrinsics.Vector128`1[Single],System.Runtime.Intrinsics.Vector128`1[Single]):System.Runtime.Intrinsics.Vector128`1[Single]
           1 ( 4.00% of base) : 18767.dasm - System.Runtime.Intrinsics.X86.Fma:MultiplySubtract(System.Runtime.Intrinsics.Vector128`1[Double],System.Runtime.Intrinsics.Vector128`1[Double],System.Runtime.Intrinsics.Vector128`1[Double]):System.Runtime.Intrinsics.Vector128`1[Double]
           1 ( 4.00% of base) : 18766.dasm - System.Runtime.Intrinsics.X86.Fma:MultiplySubtract(System.Runtime.Intrinsics.Vector128`1[Single],System.Runtime.Intrinsics.Vector128`1[Single],System.Runtime.Intrinsics.Vector128`1[Single]):System.Runtime.Intrinsics.Vector128`1[Single]
           1 ( 4.00% of base) : 18773.dasm - System.Runtime.Intrinsics.X86.Fma:MultiplySubtractAdd(System.Runtime.Intrinsics.Vector128`1[Double],System.Runtime.Intrinsics.Vector128`1[Double],System.Runtime.Intrinsics.Vector128`1[Double]):System.Runtime.Intrinsics.Vector128`1[Double]
           1 ( 4.00% of base) : 18772.dasm - System.Runtime.Intrinsics.X86.Fma:MultiplySubtractAdd(System.Runtime.Intrinsics.Vector128`1[Single],System.Runtime.Intrinsics.Vector128`1[Single],System.Runtime.Intrinsics.Vector128`1[Single]):System.Runtime.Intrinsics.Vector128`1[Single]
           1 ( 4.00% of base) : 18783.dasm - System.Runtime.Intrinsics.X86.Fma:MultiplySubtractNegated(System.Runtime.Intrinsics.Vector128`1[Double],System.Runtime.Intrinsics.Vector128`1[Double],System.Runtime.Intrinsics.Vector128`1[Double]):System.Runtime.Intrinsics.Vector128`1[Double]
           1 ( 4.00% of base) : 18782.dasm - System.Runtime.Intrinsics.X86.Fma:MultiplySubtractNegated(System.Runtime.Intrinsics.Vector128`1[Single],System.Runtime.Intrinsics.Vector128`1[Single],System.Runtime.Intrinsics.Vector128`1[Single]):System.Runtime.Intrinsics.Vector128`1[Single]
           1 ( 3.57% of base) : 18759.dasm - System.Runtime.Intrinsics.X86.Fma:MultiplyAdd(System.Runtime.Intrinsics.Vector256`1[Double],System.Runtime.Intrinsics.Vector256`1[Double],System.Runtime.Intrinsics.Vector256`1[Double]):System.Runtime.Intrinsics.Vector256`1[Double]
           1 ( 3.57% of base) : 18758.dasm - System.Runtime.Intrinsics.X86.Fma:MultiplyAdd(System.Runtime.Intrinsics.Vector256`1[Single],System.Runtime.Intrinsics.Vector256`1[Single],System.Runtime.Intrinsics.Vector256`1[Single]):System.Runtime.Intrinsics.Vector256`1[Single]
           1 ( 3.57% of base) : 18779.dasm - System.Runtime.Intrinsics.X86.Fma:MultiplyAddNegated(System.Runtime.Intrinsics.Vector256`1[Double],System.Runtime.Intrinsics.Vector256`1[Double],System.Runtime.Intrinsics.Vector256`1[Double]):System.Runtime.Intrinsics.Vector256`1[Double]
           1 ( 3.57% of base) : 18778.dasm - System.Runtime.Intrinsics.X86.Fma:MultiplyAddNegated(System.Runtime.Intrinsics.Vector256`1[Single],System.Runtime.Intrinsics.Vector256`1[Single],System.Runtime.Intrinsics.Vector256`1[Single]):System.Runtime.Intrinsics.Vector256`1[Single]
           1 ( 3.57% of base) : 18765.dasm - System.Runtime.Intrinsics.X86.Fma:MultiplyAddSubtract(System.Runtime.Intrinsics.Vector256`1[Double],System.Runtime.Intrinsics.Vector256`1[Double],System.Runtime.Intrinsics.Vector256`1[Double]):System.Runtime.Intrinsics.Vector256`1[Double]
           1 ( 3.57% of base) : 18764.dasm - System.Runtime.Intrinsics.X86.Fma:MultiplyAddSubtract(System.Runtime.Intrinsics.Vector256`1[Single],System.Runtime.Intrinsics.Vector256`1[Single],System.Runtime.Intrinsics.Vector256`1[Single]):System.Runtime.Intrinsics.Vector256`1[Single]
           1 ( 3.57% of base) : 18769.dasm - System.Runtime.Intrinsics.X86.Fma:MultiplySubtract(System.Runtime.Intrinsics.Vector256`1[Double],System.Runtime.Intrinsics.Vector256`1[Double],System.Runtime.Intrinsics.Vector256`1[Double]):System.Runtime.Intrinsics.Vector256`1[Double]
           1 ( 3.57% of base) : 18768.dasm - System.Runtime.Intrinsics.X86.Fma:MultiplySubtract(System.Runtime.Intrinsics.Vector256`1[Single],System.Runtime.Intrinsics.Vector256`1[Single],System.Runtime.Intrinsics.Vector256`1[Single]):System.Runtime.Intrinsics.Vector256`1[Single]

24 total methods with Code Size differences (0 improved, 24 regressed), 8 unchanged.


libraries_tests.pmi.windows.x64.checked.mch:


Summary of Code Size diffs:
(Lower is better)

Total bytes of base: 50
Total bytes of diff: 42
Total bytes of delta: -8 (-16.00% of base)
Total relative delta: -0.32
    diff is an improvement.
    relative diff is an improvement.
Detail diffs


Top file improvements (bytes):
          -4 : 210044.dasm (-16.00% of base)
          -4 : 209492.dasm (-16.00% of base)

2 total files with Code Size differences (2 improved, 0 regressed), 0 unchanged.

Top method improvements (bytes):
          -4 (-16.00% of base) : 209492.dasm - System.Tests.MathFTests:FusedMultiplyAdd(float,float,float,float)
          -4 (-16.00% of base) : 210044.dasm - System.Tests.MathTests:FusedMultiplyAdd(double,double,double,double)

Top method improvements (percentages):
          -4 (-16.00% of base) : 209492.dasm - System.Tests.MathFTests:FusedMultiplyAdd(float,float,float,float)
          -4 (-16.00% of base) : 210044.dasm - System.Tests.MathTests:FusedMultiplyAdd(double,double,double,double)

2 total methods with Code Size differences (2 improved, 0 regressed), 0 unchanged.


@weilinwa
Copy link
Contributor Author

@tannergooding, I have a question about Fma.MultiplyAddScalar and other scalar type FMA methods.

In instructions for FMA of scalar values like VFMADD132SS DEST, SRC1, SRC2, DEST would hold the scalar result in DEST[31:0]. DEST[127:32] would be unchanged. However, because of the 3 difference FMA forms, DEST could be mapped to any one of the three operands in Fma.MultiplyAddScalar(op1, op2, op3) .

My questions is, do we need to ensure op1[127:32] == result[127:32] (rather than op2[127:32] == result[127:32] or op3[127:32] == result[127:32]) in the definition of Fma.MultiplyAddScalar. If we do, does this mean we cannot choose the 3 FMA forms freely? For 132 and 213, we could ensure op1 is mapped to DEST because of the commutative. But for 231, DEST needs to be mapped to op3.

@tannergooding
Copy link
Member

tannergooding commented Sep 16, 2021

@tannergooding, I have a question about Fma.MultiplyAddScalar and other scalar type FMA methods.

In instructions for FMA of scalar values like VFMADD132SS DEST, SRC1, SRC2, DEST would hold the scalar result in DEST[31:0]. DEST[127:32] would be unchanged. However, because of the 3 difference FMA forms, DEST could be mapped to any one of the three operands in Fma.MultiplyAddScalar(op1, op2, op3) .

My questions is, do we need to ensure op1[127:32] == result[127:32] (rather than op2[127:32] == result[127:32] or op3[127:32] == result[127:32]) in the definition of Fma.MultiplyAddScalar. If we do, does this mean we cannot choose the 3 FMA forms freely? For 132 and 213, we could ensure op1 is mapped to DEST because of the commutative. But for 231, DEST needs to be mapped to op3.

@weilinwa, that's a great question. The TL;DR; is that yes we do need to ensure op1[127:32] == result[127:32] or more specifically that the upper result bits come from the a operand (this can be done via a pre or post move/merge if appropriate/required).

Normally we provide two versions of the scalar function where this matters, such as:

public static Vector128<float> ReciprocalScalar(Vector128<float> value);
public static Vector128<float> ReciprocalScalar(Vector128<float> upper, Vector128<float> value);

When we do this, the upper bits come from value for the first overload and from upper in the other. We do this to try and ensure determinism first and foremost.

For FMA, we only expose overloads like the first one and so the expectation is that the upper bits come from a. Today, we ensure that a (op1) can't be contained for the scalar variants so that it is always the destination (see the check for CopiesUpperBits): https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/lowerxarch.cpp#L6312-L6347

We'd need to expose MultiplyAddScalarUnsafe APIs, or something similar, to allow the upper bits to be "undefined" (that is come from any operand) and so to allow the most efficient codegen in all scenarios. That would require an API review and approval for the scenario (but is likely worth it since that would also benefit Math.FusedMultiplyAdd where the upper bits aren't exposed and don't matter).


srcCount += 1;
srcCount += BuildDelayFreeUses(emitOp2, emitOp1);
srcCount += emitOp3->isContained() ? BuildOperandUses(emitOp3) : BuildDelayFreeUses(emitOp3, emitOp1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a lot smaller and easier to follow now 🎉

Comment on lines 2365 to 2366
if (containedOpNum == 1 && !copiesUpperBits)
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the !copiesUpperBits check needed? If we are copiesUpperBits then containedOpNum shouldn't be 1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this should be true. Do we need to add an assert before to ensure that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe replace it with assert(containedOpNum != 1 || !copiesUpperBits); to also cover the regOptional case

// Intrinsics with CopyUpperBits semantics must have op1 as target
if (containedOpNum == 1 && !copiesUpperBits)
{
if (resultOpNum != 3)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if this were the positive case and there were an assert that resultOpNum != containedOpNum

Therefore, if containedOpNum == 1 then resultOpNum can only be 0, 2, or 3

If it's 3, then swapping op1/op3 is sufficient
If it's 2, then swapping op2/op3 is needed first
If it's 0, then it doesn't matter what we do so its fine to not swap

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible that containedOpNum ==0 and resultOpNum==0?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If none of the operands are overwritten and none are last use, then containedOpNum == 0.

I think we probably won't also get containedOpNum == 0 because VFMADD should support general-purpose loads as well and so RegOptional should probably be true for at least one case. But in general its better to check and account for possible future changes, scenarios, or nodes that are introduced

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because op lastUse could be updated after lowering, there are cases that we have resultOpNum == containedOpNum when they are not 0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make multiple ops contained in lowering or change that in lsra?

}
else
{
assert(containedOpNum == 2);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its possible for containedOpNum to be 0 and so we should check for this explicitly.


srcCount += op3->isContained() ? BuildOperandUses(op3) : BuildDelayFreeUses(op3, op1);
if (resultOpNum == 3 && !copiesUpperBits)
Copy link
Member

@tannergooding tannergooding Nov 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just capturing a comment, I don't think we need to do anything in this PR.

I think the logic around copiesUpperBits could be simplified a bit so we don't need these extra checks everywhere. That is, if copiesUpperBits is true, then resultOpNum doesn't matter if its not 1 so maybe we should be forcing resultOpNum to be 0 in that case (that is if copiesUpperBits == true and resultOpNum != 1, then treat it as 0, because no matter what we do, op1 cannot be swapped or moved about and op2/op3 will be delay free or contained).

// op1 = (op1 * op2) + [op3] or op2 = (op1 * op2) + [op3]
// ? = (op1 * op2) + [op3] or ? = (op1 * op2) + op3
// 213 form: XMM1 = (XMM2 * XMM1) + [XMM3]
isCommutative = copiesUpperBits;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't isCommutative be !copiesUpperBits? We can't swap anything if copiesUpperBits == true

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I used it inaccurately here to barely control if we should enter the branch.

}

regNumber op1Reg = emitOp1->GetRegNum();
regNumber op2Reg = emitOp2->GetRegNum();

if (isCommutative && (op1Reg != targetReg) && (op2Reg == targetReg))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this block still needed given the above handling?

It feels like we should already be covering this under the last block, which is op3 or nothing is contained/spilled so:

if (!copiesUpperBits && (targetReg == op2Reg))
{
    std::swap(emitOp1, emitOp2);
}

Then everything should be in the right place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it if (!copiesUpperBits && (targetReg == op2Reg)) not if (copiesUpperBits && (targetReg == op2Reg))? I thought we need to ensure targetReg is op1Reg only when copiesUpperBits is true.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because emitOp1 is already op1, so if copiesUpperBits == true, then we don't want to change anything.

When its false, we only need to swap if the target reg is op2Reg.

@weilinwa
Copy link
Contributor Author

@tannergooding, could you please take a look at the latest code when you have time? I resolved almost all of your comments except the resultOpNum and containedOpNum assertion. Thanks!

op1Reg = op3->GetRegNum();
op2Reg = op2->GetRegNum();
op3 = op1;
if (targetReg == op3NodeReg)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs to be !copiesUpperBits && (targetReg == op3NodeReg)

Otherwise, copiesUpperBits can be true since op1 is not Contained or UsedFromSpillTemp and therefore swapping emitOp1 isn't correct.

// op1 = (op1 * op2) + [op3] or op2 = (op1 * op2) + [op3]
// ? = (op1 * op2) + [op3] or ? = (op1 * op2) + op3
// 213 form: XMM1 = (XMM2 * XMM1) + [XMM3]
if (targetReg == op2NodeReg)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise, I think this needs to be if (!copiesUpperBits && (targetReg == op2NodeReg)) for the same reason.

I think we also don't need the below section doing if (!copiesUpperBits && (emitOp2->GetRegNum() == targetReg)) as it will have already been covered up here.

@tannergooding
Copy link
Member

Everything looks good except for the two related callouts in codegen.

Looks like there is also a merge conflict, like due to #59912.

Copy link
Member

@tannergooding tannergooding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all LGTM. CC. @kunalspathak or @echesakovMSFT could you give a second review and merge if everything looks good to you as well

@weilinwa
Copy link
Contributor Author

@kunalspathak @echesakovMSFT, could you take a look when you have some time? Thanks.

Copy link
Member

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to uncomment the 2 asserts and run the test to make sure they are not hit.

src/coreclr/jit/lsraxarch.cpp Outdated Show resolved Hide resolved
if (containedOpNum == 1)
{
// resultOpNum might change between lowering and lsra, comment out assertion for now.
// assert(containedOpNum != resultOpNum);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to uncomment this assert?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assertion cannot be uncommented because the last use value could change after lowering step. I left them here for follow up work if necessary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please create a issue for it and add the link to the issue in the comment here?

}
else if (containedOpNum == 3)
{
// assert(containedOpNum != resultOpNum);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here?

@ghost ghost added needs-author-action An issue or pull request that requires more info or actions from the author. and removed needs-author-action An issue or pull request that requires more info or actions from the author. labels Nov 30, 2021
Co-authored-by: Kunal Pathak <Kunal.Pathak@microsoft.com>
Copy link
Member

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @weilinwa for your patience and commitment. This looks good to me.

@kunalspathak
Copy link
Member

@weilinwa - I noticed superpmi.py replay failure on linux/x64. Can you double check if it is from your change?

ISSUE: <ASSERT> D:\a\_work\1\s\src\coreclr\jit\emitxarch.cpp (6781) - Assertion failed '(op3Reg != targetReg) || (op1Reg == targetReg)' in 'System.Numerics.Matrix4x4:Lerp(System.Numerics.Matrix4x4,System.Numerics.Matrix4x4,float):System.Numerics.Matrix4x4' during 'Generate code' (IL size 675)

https://helixre8s23ayyeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-62262-merge-d207c81ab3b14b3f92/unix-x64/1/console.122dc30b.log?sv=2019-07-07&se=2021-12-22T02%3A42%3A41Z&sr=c&sp=rl&sig=TWtmGXhWg7AuFc9lSuVCD%2FMqEkj7ZjYwRxf2ZKSSSA0%3D

@ghost ghost locked as resolved and limited conversation to collaborators Jan 3, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants