Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assertion failed 'targetReg != op2Reg' during 'Generate code' #91209

Closed
kunalspathak opened this issue Aug 28, 2023 · 15 comments · Fixed by #92505
Closed

Assertion failed 'targetReg != op2Reg' during 'Generate code' #91209

kunalspathak opened this issue Aug 28, 2023 · 15 comments · Fixed by #92505
Assignees
Labels
arch-arm64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI os-windows
Milestone

Comments

@kunalspathak
Copy link
Member

// Found by Antigen

using System;
using System.Collections.Generic;
using System.Runtime.CompilerServices;
using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.Arm;
using System.Numerics;
public class TestClass
{
    public struct S1
    {
        public struct S1_D1_F2
        {
        }
    }
    static Vector64<int> s_v64_int_22 = Vector64.Create(6);
    static Vector64<uint> s_v64_uint_23 = Vector64.Create((uint)5);
    static S1 s_s1_52 = new S1();
    float float_63 = 5.125f;
    Vector64<int> v64_int_72 = Vector64.Create(5);
    S1.S1_D1_F2 s1_s1_d1_f2_101 = new S1.S1_D1_F2();
    static int s_loopInvariant = 1;
    public S1.S1_D1_F2 Method3(ref Vector64<uint> p_v64_uint_155, out short p_short_156, float p_float_157, S1 p_s1_158, bool p_bool_159)
    {
        unchecked
        {
            p_short_156 = 15&4;
            int __loopvar3 = 15-4;
            while (15>4)
            {
                if (__loopvar3 > s_loopInvariant)
                    break;
                switch (Vector64.Dot(AdvSimd.MultiplyAddByScalar(s_v64_int_22, s_v64_int_22, s_v64_int_22), v64_int_72 = v64_int_72 & s_v64_int_22))
                {
                    case 2:
                    {
                        break;
                    }
                    case -2:
                    {
                        break;
                    }
                    default:
                    {
                        break;
                    }
                }
            }
            return s1_s1_d1_f2_101;
        }
    }
    public void Method0()
    {
        unchecked
        {
            short short_211 = -5;
            float float_215 = 6.0416665f;
            S1.S1_D1_F2 s1_s1_d1_f2_220 = new S1.S1_D1_F2();
            s1_s1_d1_f2_220 = Method3(ref s_v64_uint_23, out short_211, float_63 -= float_215 *= 15/4, s_s1_52, 15==4);
            return;
        }
    }
    public static void Main(string[] args)
    {
        new TestClass().Method0();
    }
}
/* 
Environment:

set COMPlus_TieredCompilation=0

Assert failure(PID 11200 [0x00002bc0], Thread: 10704 [0x29d0]): Assertion failed 'targetReg != op2Reg' in 'TestClass:Method3(byref,byref,float,TestClass+S1,bool):TestClass+S1+S1_D1_F2:this' during 'Generate code' (IL size 83; hash 0x8f2c057a; FullOpts)
    File: D:\a\_work\1\s\src\coreclr\jit\hwintrinsiccodegenarm64.cpp Line: 320
    Image: E:\kpathak\CORE_ROOT\corerun.exe
*/
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Aug 28, 2023
@ghost ghost added the untriaged New issue has not been triaged by the area owner label Aug 28, 2023
@ghost
Copy link

ghost commented Aug 28, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details
// Found by Antigen

using System;
using System.Collections.Generic;
using System.Runtime.CompilerServices;
using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.Arm;
using System.Numerics;
public class TestClass
{
    public struct S1
    {
        public struct S1_D1_F2
        {
        }
    }
    static Vector64<int> s_v64_int_22 = Vector64.Create(6);
    static Vector64<uint> s_v64_uint_23 = Vector64.Create((uint)5);
    static S1 s_s1_52 = new S1();
    float float_63 = 5.125f;
    Vector64<int> v64_int_72 = Vector64.Create(5);
    S1.S1_D1_F2 s1_s1_d1_f2_101 = new S1.S1_D1_F2();
    static int s_loopInvariant = 1;
    public S1.S1_D1_F2 Method3(ref Vector64<uint> p_v64_uint_155, out short p_short_156, float p_float_157, S1 p_s1_158, bool p_bool_159)
    {
        unchecked
        {
            p_short_156 = 15&4;
            int __loopvar3 = 15-4;
            while (15>4)
            {
                if (__loopvar3 > s_loopInvariant)
                    break;
                switch (Vector64.Dot(AdvSimd.MultiplyAddByScalar(s_v64_int_22, s_v64_int_22, s_v64_int_22), v64_int_72 = v64_int_72 & s_v64_int_22))
                {
                    case 2:
                    {
                        break;
                    }
                    case -2:
                    {
                        break;
                    }
                    default:
                    {
                        break;
                    }
                }
            }
            return s1_s1_d1_f2_101;
        }
    }
    public void Method0()
    {
        unchecked
        {
            short short_211 = -5;
            float float_215 = 6.0416665f;
            S1.S1_D1_F2 s1_s1_d1_f2_220 = new S1.S1_D1_F2();
            s1_s1_d1_f2_220 = Method3(ref s_v64_uint_23, out short_211, float_63 -= float_215 *= 15/4, s_s1_52, 15==4);
            return;
        }
    }
    public static void Main(string[] args)
    {
        new TestClass().Method0();
    }
}
/* 
Environment:

set COMPlus_TieredCompilation=0

Assert failure(PID 11200 [0x00002bc0], Thread: 10704 [0x29d0]): Assertion failed 'targetReg != op2Reg' in 'TestClass:Method3(byref,byref,float,TestClass+S1,bool):TestClass+S1+S1_D1_F2:this' during 'Generate code' (IL size 83; hash 0x8f2c057a; FullOpts)
    File: D:\a\_work\1\s\src\coreclr\jit\hwintrinsiccodegenarm64.cpp Line: 320
    Image: E:\kpathak\CORE_ROOT\corerun.exe
*/
Author: kunalspathak
Assignees: -
Labels:

area-CodeGen-coreclr

Milestone: -

@kunalspathak kunalspathak added arch-arm64 os-windows and removed untriaged New issue has not been triaged by the area owner labels Aug 28, 2023
@BruceForstall BruceForstall added this to the 8.0.0 milestone Aug 30, 2023
@JulieLeeMSFT
Copy link
Member

Is this duplicate of #91208?

@BruceForstall
Copy link
Member

Simpler repro:

// Found by Antigen

using System;
using System.Collections.Generic;
using System.Runtime.CompilerServices;
using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.Arm;
using System.Numerics;
public class TestClass
{
    [MethodImpl(MethodImplOptions.NoInlining)]
    private static Vector64<int> GetVector64Value()
    {
        return Vector64.Create(6);
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    private static Vector64<int> Problem()
    {
        Vector64<int> l = GetVector64Value();
        return AdvSimd.MultiplyAddByScalar(l, l, l);
    }

    public static void Main(string[] args)
    {
        Vector64<int> r = Problem();
    }
}

Doesn't look like there's anything preventing LSRA from assigning the same register to the target and multiple source arguments of these intrinsics which are marked as having RMW semantics. In the test case everything is assigned d0. Without the asserts, this leads to:

mla     v0.2s, v0.2s, v0.s[0]

If target and op1 were not already the same, say target=d1, op1=op2=op3=d0, we would get:

mov v1, v0
mla     v1.2s, v0.2s, v0.s[0]

Are the asserts valid? It seems like the generated instruction reflects the intended semantics of the intrinsic -- at least in this case, since there is no cross-lane data corruption (for example).

Or should a solution be to make the target interfere with op2/op3 (e.g., make op2/op3 "delayRegFree") so LSRA gives it a different register?

@tannergooding Opinions?

@BruceForstall
Copy link
Member

Actually, it does go through the "AddDelayFreeUses" code path, but decides not to add the delayFree setting due to it thinking one operator is a last use, so it is ok to reuse it:

>	clrjit_universal_arm64_x64.dll!LinearScan::AddDelayFreeUses(RefPosition * useRefPosition, GenTree * rmwNode) Line 3456	C++
 	clrjit_universal_arm64_x64.dll!LinearScan::BuildDelayFreeUses(GenTree * node, GenTree * rmwNode, unsigned __int64 candidates, RefPosition * * useRefPositionRef) Line 3526	C++
 	clrjit_universal_arm64_x64.dll!LinearScan::BuildHWIntrinsic(GenTreeHWIntrinsic * intrinsicTree, int * pDstCount) Line 1608	C++
 	clrjit_universal_arm64_x64.dll!LinearScan::BuildNode(GenTree * tree) Line 914	C++
 	clrjit_universal_arm64_x64.dll!LinearScan::buildRefPositionsForNode(GenTree * tree, unsigned int currentLoc) Line 1774	C++

This argues for the assert needing to incorporate that case.

@BruceForstall
Copy link
Member

Another oddity here:

BuildDelayFreeUses calls BuildUse to create a RefPosition for the first source operand (op2). When it creates it, it calls newRefPosition which calls associateRefPosWithInterval and that marks the RP as lastUse.

Then, BuildDelayFreeUses calls AddDelayFreeUses which uses this new lastUse setting to decide that we don't actually need to set the new use RP as delayRegFree because both RP intervals are the same and one is lastUse.

Next, BuildDelayFreeUses is called for the HWI op3. It creates a new RP, which associateRefPosWithInterval then marks as lastUse -- but it also un-marks the op2 use RP as lastUse!

This seems worrisome, that a decision (to not set delayRegFree) was made based on the lastUse setting, which later changed. Is this a problem?

@BruceForstall
Copy link
Member

Eliminating the optimization in AddDelayFreeUses avoids the assert in this tests case, and leads to no asm diffs (on win-arm64):

Interval* rmwInterval = nullptr;
bool rmwIsLastUse = false;
GenTree* addr = nullptr;
if ((rmwNode != nullptr) && isCandidateLocalRef(rmwNode))
{
rmwInterval = getIntervalForLocalVarNode(rmwNode->AsLclVar());
// Note: we don't handle multi-reg vars here. It's not clear that there are any cases
// where we'd encounter a multi-reg var in an RMW context.
assert(!rmwNode->AsLclVar()->IsMultiReg());
rmwIsLastUse = rmwNode->AsLclVar()->IsLastUse(0);
}
// If node != rmwNode, then definitely node should be marked as "delayFree".
// However, if node == rmwNode, then we can mark node as "delayFree" only if
// none of the node/rmwNode are the last uses. If either of them are last use,
// we can safely reuse the rmwNode as destination.
if ((useRefPosition->getInterval() != rmwInterval) || (!rmwIsLastUse && !useRefPosition->lastUse))

@BruceForstall
Copy link
Member

Testing diffs with removed optimization here: #92391

@BruceForstall
Copy link
Member

On x64, removing the optimization shows a few regressions due to extra moves.

@BruceForstall
Copy link
Member

It seems odd that BuildOperandUses() and BuildDelayFreeUses() are as different as they are. Also, BuildDelayFreeUses() returns the created RP via the useRefPositionRef argument, but that is not consistently set, and there are cases where multiple RPs are created, but only one is returned, which seems wrong, or at least poorly defined.

Due to the changing of the RP lastUse flag, and it being updated, as described above, perhaps setting delayRegFree should be deferred until all the use RefPositions for a node have been built? I.e., build use RPs for op1, op2, op3, then go back and mark them delayRegFree as necessary?

@BruceForstall
Copy link
Member

The regressions from removing the AddDelayFreeUses optimization are on x86/x64, as described: a few extra copies:

https://dev.azure.com/dnceng-public/public/_build/results?buildId=413805&view=ms.vss-build-web.run-extensions-tab

@BruceForstall
Copy link
Member

@kunalspathak Do you have any thoughts about the LSRA questions I mention above, especially #91209 (comment) and #91209 (comment)?

@kunalspathak
Copy link
Member Author

kunalspathak commented Sep 22, 2023

With Antigen, we are lately seeing pattern of issues where we end up passing same variable for different parameters to intrinsic methods. With that, we end up having same interval for use and rmw i.e. useRefPosition->getInterval() == rmwInterval, after which we rely on the lastUse bit to decide whether to set delayRegFree or not.

if ((useRefPosition->getInterval() != rmwInterval) || (!rmwIsLastUse && !useRefPosition->lastUse))

It is uncommon to have useRefPosition->getInterval() == rmwInterval. To validate it, I added assert(false) for this condition and didn't hit it for windows/x64 in benchmarks/realworld/libraries.pmi collections.

delayRegFree depending on not so up to date lastUse bit.

I agree that we should be setting the delayRegFree once we finalize the lastUse, but there are two problems with that:

  • We will have to iterate through all the RefPositions once they are created to check for lastUse bit and update the delayRegFree accordingly. This can be easily solved by just checking the delayRefFree candidates.
  • When we mark delayRegFree, the new interval where the result is produced for that GenTree* node (Refposition of type RefTypeDef) gets marked with hasInterferingUses. This bit is used in checkConflictingDefUse() while building the further RefPositions. The method basically checks if the register(s) for RefTypeUse of any interval can be reused at the interval's RefTypeDef RefPosition where the result is produced. If yes, it updates the register(s) for the RefTypeDef. However, if the interval is marked as hasInterferingUses, we avoid doing such update because that can conflict with the register that we want to delay free for the GenTree* that produced the result in that RefTypeDef. To summarize, there is no straight way to defer the delayRegFree decision after creating all the RefPositions because that bit is used while creating them as well. To make it work, we will have to also defer the setting of hasInterferingUses and then finding the corresponding RefTypeUse refpositions to readjust the register assigments, which can be complicated and TP impacting. I can probably take a conservative approach where we continue to mark intervals with hasInterferingUses as long as useRefPosition->getInterval() != rmwInterval. That way, we will not rely on lastUse bits to mark an interval with hasInterferingUses. Had we checked the correct lastUse bits and had the marking of interval with hasInterferingUses was not needed, we would see an extra move (not sure if these are the same that you saw in Assertion failed 'targetReg != op2Reg' during 'Generate code' #91209 (comment)). The extra move could be because the register might be different for RefTypeDef and RefTypeUse that I mentioned earlier. However, with this change, we will at least maintain the correct state.

I recently ran into similar issue in #91798 and fixed by updating the assert. However, that was a slightly different scenario where we were marking delayRegFree properly but were using homeReg instead. However, there too, we were using same variable for 2 parameters in BlendVariable() which surfaced this problem.

@kunalspathak
Copy link
Member Author

Sent out a prototype in #92496.

@BruceForstall
Copy link
Member

I realized that for this assertion (and related similar assertions), I can avoid them reasonably by only asserting if targetReg != op1Reg (where op1 is the RMW argument). I think this makes sense, if we believe that LSRA properly assigned a def register the same as a use register only if the use register is "last use". If targetReg == op1Reg, then it can also equal the other "use" operand registers if they are last use. I'll submit a PR soon.

@kunalspathak
Copy link
Member Author

If targetReg == op1Reg, then it can also equal the other "use" operand registers if they are last use

That's what I ended up doing in https://github.com/dotnet/runtime/pull/92183/files#diff-6b4e0f32449f2f144e05699f59f74415a564693637f643084a896dfbd081830dR8612. However, the "last use" information that LSRA relies on become stale, so we want to fix that eventually.

BruceForstall added a commit to BruceForstall/runtime that referenced this issue Sep 22, 2023
There are a bunch of asserts in arm64 hardware intrinsics codegen,
for intrinsics with read-modify-write behavior, that the target
register is not the same as the non-RMW operand registers, since
we do a `mov` to the target register which could trash those argument
registers. However, if the target register and the `op1` register are
the same, then no `mov` is necessary, and also, the register allocator
should already have ensured proper lifetime conflict resolution.

So, put the asserts under a check that the `mov` is required.

Fixes dotnet#91209
@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Sep 22, 2023
BruceForstall added a commit that referenced this issue Sep 23, 2023
* Fix asserts about arm64 hardware intrinsic register selection

There are a bunch of asserts in arm64 hardware intrinsics codegen,
for intrinsics with read-modify-write behavior, that the target
register is not the same as the non-RMW operand registers, since
we do a `mov` to the target register which could trash those argument
registers. However, if the target register and the `op1` register are
the same, then no `mov` is necessary, and also, the register allocator
should already have ensured proper lifetime conflict resolution.

So, put the asserts under a check that the `mov` is required.

Fixes #91209

* Fix CLRTestTargetUnsupported usage

Also, exclude mono for Runtime_91209 test.
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Sep 23, 2023
@ghost ghost locked as resolved and limited conversation to collaborators Oct 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
arch-arm64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI os-windows
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants