Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assert failure: !pFlat->HasReadyToRunHeader() #68654

Closed
SingleAccretion opened this issue Apr 28, 2022 · 23 comments
Closed

Assert failure: !pFlat->HasReadyToRunHeader() #68654

SingleAccretion opened this issue Apr 28, 2022 · 23 comments
Assignees
Labels
area-ReadyToRun-coreclr blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms'
Milestone

Comments

@SingleAccretion
Copy link
Contributor

Seen in runtime-dev-innerloop:

1) https://dev.azure.com/dnceng/public/_build/results?buildId=1741950&view=logs&j=9db4066d-6bf0-549a-7716-e181239d2ea7&t=ee7de8a6-b1ed-5700-3f81-b0654bf893a4
2) https://dev.azure.com/dnceng/public/_build/results?buildId=1742050&view=logs&j=9db4066d-6bf0-549a-7716-e181239d2ea7&t=ee7de8a6-b1ed-5700-3f81-b0654bf893a4

  crossgen2 -> /Users/runner/work/1/s/artifacts/bin/coreclr/OSX.x64.Release/crossgen2/osx-x64/publish/
  
  Assert failure(PID 30724 [0x00007804], Thread: 157356 [0x266ac]): !pFlat->HasReadyToRunHeader()
      File: /Users/runner/work/1/s/src/coreclr/vm/peimagelayout.cpp Line: 87
      Image: /Users/runner/work/1/s/artifacts/bin/coreclr/OSX.x64.Release/crossgen2/osx-x64/publish/crossgen2
  
  /var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/tmp837be50a24fb42d4918c050befffacf4.exec.cmd: line 2: 30724 Abort trap: 6           /Users/runner/work/1/s/artifacts/bin/coreclr/OSX.x64.Release/crossgen2/osx-x64/publish/crossgen2 /Users/runner/work/1/s/artifacts/bin/coreclr/OSX.x64.Debug/IL/System.Private.CoreLib.dll --out /Users/runner/work/1/s/artifacts/obj/Microsoft.NETCore.App.Crossgen2/Release/net7.0/osx-x64/S.P.C.tmp
/Users/runner/work/1/s/src/installer/pkg/sfx/Microsoft.NETCore.App/Microsoft.NETCore.App.Crossgen2.sfxproj(70,5): error MSB3073: The command "/Users/runner/work/1/s/artifacts/bin/coreclr/OSX.x64.Release/crossgen2/osx-x64/publish/crossgen2 /Users/runner/work/1/s/artifacts/bin/coreclr/OSX.x64.Debug/IL/System.Private.CoreLib.dll --out /Users/runner/work/1/s/artifacts/obj/Microsoft.NETCore.App.Crossgen2/Release/net7.0/osx-x64/S.P.C.tmp" exited with code 134.
@danmoseley
Copy link
Member

I got this on another PR last night.

@danmoseley danmoseley added the blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms' label Apr 28, 2022
@danmoseley
Copy link
Member

danmoseley commented Apr 28, 2022

We need to set firm criteria but generally we're labeling failures in PR validation/CI with blocking-clean-ci, if they have happened in the last 2 weeks or so and are not one off cases. Feel free to use this label.

@mangod9 mangod9 removed the untriaged New issue has not been triaged by the area owner label May 2, 2022
@mangod9 mangod9 added this to the 7.0.0 milestone May 2, 2022
@mangod9
Copy link
Member

mangod9 commented May 2, 2022

adding @trylek to please take a look since this is affecting PRs.

@trylek
Copy link
Member

trylek commented May 2, 2022

Hmm, this is weird. This happens when the CoreCLR native runtime runs the crossgen2 app to compile System.Private.CoreLib; but at that point Crossgen2 itself shouldn't be crossgenned, that only happens "later" in the installer build. At a first glance, I suspect this to be related to @agocke's #67636 (the timing more or less matches as the PR was merged in 6 days ago). I also believe that this is the code area that @VSadov substantially refactored some time ago as part of his work on single exe publishing; maybe the invariants have shifted somehow w.r.t. Native AOT CG2 build?

@agocke
Copy link
Member

agocke commented May 2, 2022

Looks like this is only happening on Mac. NativeAOT isn't supported on Mac, so this should be R2R + single-file.

@agocke
Copy link
Member

agocke commented May 2, 2022

This might be a real product issue. I think this is the only testing configuration where we don't run single file in Release, where I presume the assert doesn't exist. + @VSadov

@VSadov
Copy link
Member

VSadov commented May 2, 2022

I think this is the same as #67062

I am investigating the issue and it looks like we sometimes see PE sections overlapping in memory. This is either a loader bug or crossgen bug. Most likely crossgen.
Either way we should be able to layout a PE that we ourselves produce.

We used to silently take a fallback approach, now it will cause a failure (and assert in Debug).
The assert detects that we are trying to load an R2R assembly as Flat, it means trying to map it has failed as we could not handle the PE format, that is unexpected.

@VSadov
Copy link
Member

VSadov commented May 2, 2022

The algorithms are the same for OSX and Unix in general, but we see these failures on OSX only. The main difference is that OSX has larger OS page and we align to that when we map.

I think we may not leave enough room between sections RVAs for potentially bigger alignment. That is my current theory.

@VSadov
Copy link
Member

VSadov commented May 2, 2022

This happens when the CoreCLR native runtime runs the crossgen2 app to compile System.Private.CoreLib

The failure indicates that crossgen (or some of its assemblies) is R2R already.
How can it happen before CoreLib is crossgenned? Are we using assemblies from the toolset?

If the bug is in crossgen, I wonder how soon this failure will go away after it is fixed in main.

@agocke
Copy link
Member

agocke commented May 2, 2022

For the log,

crossgen2 -> /Users/runner/work/1/s/artifacts/bin/coreclr/OSX.x64.Release/crossgen2/osx-x64/publish/

Assert failure(PID 30724 [0x00007804], Thread: 157356 [0x266ac]): !pFlat->HasReadyToRunHeader()
File: /Users/runner/work/1/s/src/coreclr/vm/peimagelayout.cpp Line: 87
Image: /Users/runner/work/1/s/artifacts/bin/coreclr/OSX.x64.Release/crossgen2/osx-x64/publish/crossgen2

I think it's possible that System.Private.CoreLib may already be crossgen'd here, but the rest of the framework assemblies might not be. IIRC we run crossgen against S.P.C in the normal build, but not against the libraries.

@trylek
Copy link
Member

trylek commented May 2, 2022

I think there's a bit of subtlety worth elaborating on. Please note that the build step that fails is execution of Crossgen2 with the aim to compile System.Private.CoreLib; if we were trying to recompile an already crossgenned SPC, that would probably amount to a build script bug causing us to run the step twice or something. But I believe that's not what's happening for two reasons:

  1. When Crossgen2 only manipulates System.Private.CoreLib as the compilation input / output, it doesn't use the native peimagelayout to load the DLL, it manipulates is using the managed library System.Reflection.PortableExecutable.

  2. The last time I heard Crossgen2 in testing isn't actually executed using the live .NET Core bits, that was rough as in the past Crossgen2 compilation bugs ended up emitting a flaky framework and the Crossgen2 on top of this was subsequently failing in weird ways then building tests. Before he left the team, Simon changed the scripts so that Crossgen2 execution in framework and CoreCLR test build should be using the "repo dotnet" SDK.

@VSadov, what is the difference between Linux and OSX page sizes for the individual architectures? The only conditional I originally put in was that 32bit = 4K, 64bit = 64K. If it's more involved, this may certainly need updating. Having said that, please note that the check for the presence of R2R header among others checks the R2R version number - I have a hard time to imagine how this could just randomly appear in the in-memory mapped executable space due to some section overlaps.

@agocke
Copy link
Member

agocke commented May 2, 2022

Ah, yes, if this is the test execution of crossgen2 after it has been published, that's this logic:

  <Target Name="RunPublishedCrossgen" AfterTargets="PublishCrossgen"
          Condition="'$(TargetOS)' == '$(HostOS)' and '$(TargetArchitecture)' == '$(BuildArchitecture)'">
    <!-- Run the published crossgen if we're not cross-compiling -->
    <Exec Command="@(FilesToPackage) $(CoreCLRArtifactsPath)IL/System.Private.CoreLib.dll --out $(IntermediateOutputPath)S.P.C.tmp" Condition="'%(FileName)%(Extension)' == 'crossgen2$(ExeSuffix)'"
          ConsoleToMsBuild="true">
      <Output TaskParameter="ConsoleOutput" PropertyName="CrossgenOutput" />
      <Output TaskParameter="ExitCode" PropertyName="CrossgenExitCode" />
    </Exec>
    <Error Text="Crossgen failed with code $(CrossgenExitCode), output: $(CrossgenOutput)" Condition="$(CrossgenExitCode) != 0" />
  </Target>

Which notably runs against $(CoreCLRArtifactsPath)IL/System.Private.CoreLib.dll, which should not be R2Red.

@trylek
Copy link
Member

trylek commented May 2, 2022

Yes, I believe that's exactly it according to the log:

  Assert failure(PID 31107 [0x00007983], Thread: 154553 [0x25bb9]): !pFlat->HasReadyToRunHeader()
      File: /Users/runner/work/1/s/src/coreclr/vm/peimagelayout.cpp Line: 87
      Image: /Users/runner/work/1/s/artifacts/bin/coreclr/OSX.x64.Release/crossgen2/osx-x64/publish/crossgen2
  
  /var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/tmp5d8ef3b259b44fadafe924759643b8b1.exec.cmd: line 2: 31107 Abort trap: 6           /Users/runner/work/1/s/artifacts/bin/coreclr/OSX.x64.Release/crossgen2/osx-x64/publish/crossgen2 /Users/runner/work/1/s/artifacts/bin/coreclr/OSX.x64.Debug/IL/System.Private.CoreLib.dll --out /Users/runner/work/1/s/artifacts/obj/Microsoft.NETCore.App.Crossgen2/Release/net7.0/osx-x64/S.P.C.tmp
/Users/runner/work/1/s/src/installer/pkg/sfx/Microsoft.NETCore.App/Microsoft.NETCore.App.Crossgen2.sfxproj(70,5): error MSB3073: The command "/Users/runner/work/1/s/artifacts/bin/coreclr/OSX.x64.Release/crossgen2/osx-x64/publish/crossgen2 /Users/runner/work/1/s/artifacts/bin/coreclr/OSX.x64.Debug/IL/System.Private.CoreLib.dll --out /Users/runner/work/1/s/artifacts/obj/Microsoft.NETCore.App.Crossgen2/Release/net7.0/osx-x64/S.P.C.tmp" exited with code 134.
##[error]src/installer/pkg/sfx/Microsoft.NETCore.App/Microsoft.NETCore.App.Crossgen2.sfxproj(70,5): error MSB3073: (NETCORE_ENGINEERING_TELEMETRY=Build) The command "/Users/runner/work/1/s/artifacts/bin/coreclr/OSX.x64.Release/crossgen2/osx-x64/publish/crossgen2 /Users/runner/work/1/s/artifacts/bin/coreclr/OSX.x64.Debug/IL/System.Private.CoreLib.dll --out /Users/runner/work/1/s/artifacts/obj/Microsoft.NETCore.App.Crossgen2/Release/net7.0/osx-x64/S.P.C.tmp" exited with code 134.

@trylek
Copy link
Member

trylek commented May 2, 2022

There's one other thing that I find super weird. According to

<CrossGenDllCmd>$(DotNetCli) $([MSBuild]::NormalizePath('$(BinDir)', '$(CrossDir)', 'crossgen2', 'crossgen2.dll'))</CrossGenDllCmd>

we should be running crossgen2 via dotnet but in the above log it looks like we're directly executing the native app "crossgen2". I don't see any such code path capable of emitting the above command-line in my repo clone that's about 2 days old; where can it be coming from?

@trylek
Copy link
Member

trylek commented May 2, 2022

Hmm, my bad, we're apparently running it through the sfxproj, I guess I'm now completely confused where / how the SPC crossgenning takes place. Is the crossgen2-corelib script still used?

@agocke
Copy link
Member

agocke commented May 2, 2022

Yup, there are two crossgen calls now. The first is at the very beginning of the CoreCLR build, which we need for all later runs with S.P.C.

The second is during package construction, where we compile crossgen2 as either NativeAOT or single-file. That's a live build -- it gathers all the live binaries and compiles it then. It then compiles S.P.C one last time, as a basic validation that the build works. That last build is the one that's failing.

@trylek
Copy link
Member

trylek commented May 2, 2022

Ahh, thanks Andy, I see it now in the log. So that's actually an additional interesting data point - the "first" CG2 execution for SPC compilation under "dotnet" passes just fine, it's only the second one using presumably the single-file CG2 version that crashes when compiling SPC. In such case I'm going the reassign the bug to Vlad for now for further investigation. If it turns out that the problem is indeed in Crossgen2 producing incorrect PE layout on OSX, I'll be happy to get involved again, sadly I don't currently have a local OSX machine available for testing so I guess I'd need to work with someone who does.

@trylek trylek assigned VSadov and unassigned trylek May 2, 2022
@VSadov
Copy link
Member

VSadov commented May 2, 2022

I have a hard time to imagine how this could just randomly appear in the in-memory mapped executable space due to some section overlaps.

No, the scenario is like:

  • we are mapping the PE into memory
  • we are checking that sections that we map will not be overlapping the previous one
  • we see one that overlaps, so we fail mapping
  • loader sees that mapping failed and tries to load as Flat.
    That would be ok for non-R2R IL only assembly. We would prefer Loaded, but Flat is also ok. We could have failed to load due to platform mismatch, but non-R2R assembly would not have any native bits and would be jitted anyways.
  • we notice that this is actually R2R assembly and it should have been loaded. There is something in the PE format that we could not handle. At this point we Fail/Assert.

@trylek
Copy link
Member

trylek commented May 2, 2022

I see, thanks Vlad for clarifying. Please let me know if there's anything I can help with on the Crossgen2 side, for now the fact that this only fails when Crossgen2 is executed in the single-exe mode makes me believe it's likely rather related to the logic of loading assemblies from the bundled exe rather than a general Crossgen2 error - after all, the Crossgen2-compiled framework assemblies and tests have been running in the lab on OSX-x64 for more than two years by now without any crash like this.

@agocke
Copy link
Member

agocke commented May 2, 2022

I wonder if this could be Debug/Release S.P.C/runtime mismatch, but I thought I guarded against that possibility with

<!-- Copy System.Private.CoreLib from the coreclr bin directory to the runtime pack directory,
as we always need the copy of System.Private.CoreLib that matches exactly with the runtime. -->
<Copy SourceFiles="$(CoreCLRArtifactsPath)System.Private.CoreLib.dll"
DestinationFolder="$(MicrosoftNetCoreAppRuntimePackNativeDir)"
SkipUnchangedFiles="true" />

@VSadov
Copy link
Member

VSadov commented May 3, 2022

@agocke no, a S.P.C/runtime mismatch would not assert at layout time. It would have random behavior/crash later.

@trylek yes, singlefile is a factor.
On Unix we can align sections of PE file on smaller granularity than an OS page, but when it comes to mapping we round to the page and map sections from the file with some surrounding data. That is ok as long as destination regions for different sections, in aligned up form, do not overlap.

We had some issues with this on OSX in the past. We were just handling it silently by using fallback strategy by copying instead of mapping. - The assembly would still load, but start up would be impacted, R2R disabled, etc... It is intentional that a failure to map causes asserts/failures now.
The issues that we had were fixed, but evidently not enough for some singlefile cases.

Either way. What crossgen produce must match what loader expects. One of the two will need to take a fix.

@jakobbotsch
Copy link
Member

Fixed by #68845 (not sure why this didn't auto close)

@ghost ghost locked as resolved and limited conversation to collaborators Jun 5, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-ReadyToRun-coreclr blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms'
Projects
None yet
Development

No branches or pull requests

8 participants