Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

arm64 skippage6.sh test fails to JIT code when page size > 4KB #42023

Closed
sdmaclea opened this issue Sep 9, 2020 · 27 comments · Fixed by #47862
Closed

arm64 skippage6.sh test fails to JIT code when page size > 4KB #42023

sdmaclea opened this issue Sep 9, 2020 · 27 comments · Fixed by #47862
Assignees
Labels
arch-arm64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI in-pr There is an active PR which will close this issue when it is merged os-macos-bigsur (macOS11)
Milestone

Comments

@sdmaclea
Copy link
Contributor

sdmaclea commented Sep 9, 2020

Cannot encode 15552 as an immediate (with #40435).

/Users/steve/git/runtime/artifacts/tests/coreclr/OSX.arm64.Debug/JIT/Methodical/largeframes/skip6/skippage6/skippage6.sh

Assert failure(PID 49620 [0x0000c1d4], Thread: 690896 [0xa8ad0]): Assertion failed '!"Instruction cannot be encoded: IF_DI_2A"' in 'BigFrames.Test:Test1(int)' during 'Generate code' (IL size 23715)

    File: /Users/steve/git/runtime/src/coreclr/src/jit/emitarm64.cpp Line: 5598
    Image: /Users/steve/git/runtime/artifacts/tests/coreclr/OSX.arm64.Debug/Tests/Core_Root/corerun

/Users/steve/git/runtime/artifacts/tests/coreclr/OSX.arm64.Debug/JIT/Methodical/largeframes/skip6/skippage6/skippage6.sh: line 356: 49620 Abort trap: 6           $LAUNCHER $ExePath "${CLRTestExecutionArguments[@]}"
BEGIN EXECUTION
/Users/steve/git/runtime/artifacts/tests/coreclr/OSX.arm64.Debug/Tests/Core_Root/corerun skippage6.dll ''
Expected: 100
Actual: 134
END EXECUTION - FAILED
    frame #2: 0x0000000123d6d1b4 libclrjit.dylib`::assertAbort(why="!\"Instruction cannot be encoded: IF_DI_2A\"", file="/Users/steve/git/runtime/src/coreclr/src/jit/emitarm64.cpp", line=5598) at error.cpp:294:9
    frame #3: 0x0000000123f92340 libclrjit.dylib`emitter::emitIns_R_R_I(this=0x00000001007e5000, ins=INS_sub, attr=EA_8BYTE, reg1=REG_ZR, reg2=REG_ZR, imm=15552, opt=INS_OPTS_NONE) at emitarm64.cpp:5598:13
  * frame #4: 0x0000000123e4447c libclrjit.dylib`CodeGen::inst_RV_IV(this=0x00000001007e4a78, ins=INS_sub, reg=REG_SP, val=15552, size=EA_8BYTE, flags=INS_FLAGS_DONT_CARE) at instr.cpp:521:19
    frame #5: 0x0000000123f6ad70 libclrjit.dylib`CodeGen::genStackPointerConstantAdjustment(this=0x00000001007e4a78, spDelta=-15552) at codegenarmarch.cpp:42:5

/cc @sandreenko @JulieLeeMSFT

category:correctness
theme:prolog-epilog
skill-level:intermediate
cost:small

@sdmaclea sdmaclea added arch-arm64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI os-macos-bigsur (macOS11) labels Sep 9, 2020
@sdmaclea sdmaclea added this to the 6.0.0 milestone Sep 9, 2020
@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Sep 9, 2020
@sandreenko sandreenko self-assigned this Sep 9, 2020
@sandreenko sandreenko removed the untriaged New issue has not been triaged by the area owner label Sep 9, 2020
@sandreenko
Copy link
Contributor

cc @dotnet/jit-contrib

@BruceForstall
Copy link
Member

This test runs on Windows/Linux arm64 in CI; what is different about the macOS code?

@sdmaclea
Copy link
Contributor Author

sdmaclea commented Sep 10, 2020

It looks like it is just a different immediate value that is causing the problem.

Perhaps simply because I was running in debug mode?
Looks like it is more likely due to vm.pagesize: 16384 Edit: looking at the test source I don't see a reason this would be related...

@BruceForstall
Copy link
Member

Looks like it is more likely due to vm.pagesize: 16384

Actually, that seems very interesting/relevant, since the JIT stack probing code is based on page size.

@echesakov
Copy link
Contributor

sub sp, sp, #imm supports immediate in the range 0 to 4095 (and their 12 LSL versions).
As a workaround, on arm64 JIT can always probe in 4K steps (which means 4 touches per 16K page).

In the long term the stack probing should be done via helper call #13519 - JIT (or runtime) can pick the appropriate helper based on page size (here I presume helpers for 4K and 16K might be implemented slightly differently).

@BruceForstall BruceForstall added the JitUntriaged CLR JIT issues needing additional triage label Oct 28, 2020
@BruceForstall BruceForstall removed the JitUntriaged CLR JIT issues needing additional triage label Nov 7, 2020
@tmds
Copy link
Member

tmds commented Jan 12, 2021

RHEL 8 (and derived) uses 64kB pages on arm64 and is also affected by this issue (see #43349).

cc @janvorli @omajid @leecow

@tmds
Copy link
Member

tmds commented Jan 12, 2021

Does this affect x64 with pages larger than 4kB also?

@janvorli
Copy link
Member

There is a similar problem on x64 with larger pages. We've hit it when testing .NET running on the new Apple Silicon developer test device under Rosetta 2 emulation which emulates x64 on arm64. The page size was 16kB there. However, it turned out that the final versions of Apple Silicon that are now sold as Apple M1 have fixed the page size for the emulation to be 4kB again, so we didn't have to fix it. Does RHEL 8 on x64 support non-4kB normal page sizes too?

@tmds
Copy link
Member

tmds commented Jan 13, 2021

Does RHEL 8 on x64 support non-4kB normal page sizes too?

I've used a couple of machines and they all had a page size of 64kB so I assume that is the value used when the kernel gets compiled for arm64 RHEL8.

As a workaround, on arm64 JIT can always probe in 4K steps (which means 4 touches per 16K page).

Would this workaround be feasible also for 64kB pages (-> 16 touches)?

@sandreenko are you, or someone else working on this?

@sandreenko
Copy link
Contributor

@echesakovMSFT is working on it

@sdmaclea sdmaclea changed the title osx-arm64 skippage6.sh test fails to JIT code arm64 skippage6.sh test fails to JIT code when page size > 4KB Jan 28, 2021
@JulieLeeMSFT
Copy link
Member

@echesakovMSFT please prioritize this because this is a blocking issue on RHEL8.

@echesakov
Copy link
Contributor

@JulieLeeMSFT Sure, I am actively working on #43250 that will close this issue

@echesakov echesakov added the in-pr There is an active PR which will close this issue when it is merged label Feb 8, 2021
@echesakov
Copy link
Contributor

The issue with the JIT asserting on Arm64 platforms with a page size larger than 4KB should be resolved by #47862.

@sandreenko confirmed that the Pri1 tests pass with that change on Apple M1 (see his comment here)

@tmds Can you please also verify this the tests pass with RHEL 8 on Arm64 on top of current master (bf9f899)?

@JulieLeeMSFT
Copy link
Member

Assigning @tmds to run test on RHEL8 Arm64.
@tmds we target to release this by Preview 2 (Feb-19), so it would be good if you can run tests this week.

@tmds
Copy link
Member

tmds commented Feb 10, 2021

Thanks for fixing!

I'll look at running these tests. It will take some time because the workflow is cumbersome since it involves a non-RHEL8 arm64 machine to not run into the issue while building tests, and then a RHEL8 arm64 machine to run the tests.

I hope this fix will get rid of the NullReferenceExceptions we're seeing: #43349.

For it to be usable for us, the fix needs to land in the SDK that we use in source-build to build dotnet/runtime and other .NET repos.

Do you currently plan to backport this to .NET 5?

cc @omajid @dagood

@echesakov
Copy link
Contributor

I hope this fix will get rid of the NullReferenceExceptions we're seeing: #43349.

It's unlikely. The issue you mentioned exhibits quite different symptoms.

For it to be usable for us, the fix needs to land in the SDK that we use in source-build to build dotnet/runtime and other .NET repos.

I am not familiar with the source-build infrastructure. Can you please elaborate on what version of the SDK are you talking about? Is it 5.0 or 6.0?

Do you currently plan to backport this to .NET 5?

I wasn't originally but if this blocks you in 5.0 I would propose a simpler (less risky) change (as in #45226)

@tmds
Copy link
Member

tmds commented Feb 10, 2021

It's unlikely. The issue you mentioned exhibits quite different symptoms.

We didn't find many clues from running the priority1 tests. I'm keeping my fingers crossed.

I am not familiar with the source-build infrastructure. Can you please elaborate on what version of the SDK are you talking about? Is it 5.0 or 6.0?

source-build has branches for each .NET version. .NET 5.0 branch uses a 5.0 SDK. .NET 6.0 branch will use a 6.0 SDK.

I wasn't originally but if this blocks you in 5.0 I would propose a simpler (less risky) change (as in #45226)

#43349 blocks us. When we know what is needed to fix it, we can see if it makes sense to apply changes to 5.0, or delay till 6.0.

@JulieLeeMSFT
Copy link
Member

source-build has branches for each .NET version. .NET 5.0 branch uses a 5.0 SDK. .NET 6.0 branch will use a 6.0 SDK.

I assume that you can use 6.0 SDK, and we don't need to make changes to .NET 5.0.

#43349 blocks us. When we know what is needed to fix it, we can see if it makes sense to apply changes to 5.0, or delay till 6.0.

JanV couldn't repro the issue. Let's keep on eye on it and track NullReferenceExceptions issue from #43349.

@sdmaclea
Copy link
Contributor Author

sdmaclea commented Feb 10, 2021

We are seeing a lot of NullReferenceExceptions in gcstress of Apple Silicon. We also see them as occasional intermittent failures on Apple Silicon. I will need to look into these to get Apple Silicon stable. These likely will affect RHEL8 arm64 too.

@tmds
Copy link
Member

tmds commented Feb 10, 2021

I assume that you can use 6.0 SDK, and we don't need to make changes to .NET 5.0.

To build for 5.0, we need a 5.0 SDK.

We are seeing a lot of NullReferenceExceptions in gcstress of Apple Silicon. We also see them as occasional intermittent failures on Apple Silicon. I will need to look into these to get Apple Silicon stable. These likely will affect RHEL8 arm64 too.

Are these tests I can run? Is there an issue tracking this?

We don't see NullReferenceExceptions on Fedora arm64. Fedora arm64 uses a 4KB page size, which is why I thought the NullReferenceExceptions may be related to the page size.

@JulieLeeMSFT
Copy link
Member

To build for 5.0, we need a 5.0 SDK.

@echesakovMSFT can make a simpler (less risky) change as he mentioned for .NET 5 (as in #45226).
@echesakovMSFT please go ahead for the change.

@tmds
Copy link
Member

tmds commented Feb 10, 2021

@JulieLeeMSFT @echesakovMSFT I'm fine waiting to make changes to 5.0 until we figured out all fixes needed to solve the NullReferenceExceptions issue.

@JulieLeeMSFT
Copy link
Member

We are seeing a lot of NullReferenceExceptions in gcstress of Apple Silicon.

@sdmaclea I guess you can use this .NET 6 fix to test Apple Silicon NullReferenceExceptions issue, correct?

@sdmaclea
Copy link
Contributor Author

@sdmaclea I guess you can use this .NET 6 fix to test Apple Silicon NullReferenceExceptions issue, correct?

@JulieLeeMSFT As @echesakovMSFT said it is unlikely that this fixed #43349 or the equivalent issues we are seeing on Apple Silicon. I was just commenting that this is a common issue which we are likely going to invest in fixing for 6.0.

I will certainly be testing this as we bring up CI for Apple Silicon.

@echesakovMSFT can make a simpler (less risky) change as he mentioned for .NET 5 (as in #45226).

I am not sure why #45226 wouldn't have fixed the issue already. I thought it was the only place the JIT asked about the stack size.

@tmds
Copy link
Member

tmds commented Feb 15, 2021

we target to release this by Preview 2 (Feb-19), so it would be good if you can run tests this week.

@JulieLeeMSFT I ran the tests, and the "Instruction cannot be encoded: IF_DI_2A" is no longer present in the log. This issue can be closed.

I am not sure why #45226 wouldn't have fixed the issue already. I thought it was the only place the JIT asked about the stack size.

@sdmaclea the PR you reference was not merged.

Is there an issue for the NullReferenceExceptions you've observed on the Apple Silicon?

I'm fine waiting to make changes to 5.0 until we figured out all fixes needed to solve the NullReferenceExceptions issue.

@JulieLeeMSFT @echesakovMSFT We've discussed this internally, and we've changed our milestone for .NET on arm64 to .NET 6.

@sdmaclea
Copy link
Contributor Author

Is there an issue for the NullReferenceExceptions you've observed on the Apple Silicon?

It is not called out specifically, but it was some of the symptoms I was seeing with intermittent failures #46365 and gcstress failures #46330 & #46471.

@JulieLeeMSFT
Copy link
Member

@JulieLeeMSFT I ran the tests, and the "Instruction cannot be encoded: IF_DI_2A" is no longer present in the log. This issue can be closed.

@tmds thanks for the note. Will close this now.

@ghost ghost locked as resolved and limited conversation to collaborators Mar 20, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
arch-arm64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI in-pr There is an active PR which will close this issue when it is merged os-macos-bigsur (macOS11)
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

8 participants