Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[wasm][aot] Link step failing for Microsoft.Extensions.Logging.Generators.Roslyn3.11.Tests - wasm-opt getting killed #51961

Closed
radical opened this issue Apr 27, 2021 · 30 comments · Fixed by #60701 or #63958
Assignees
Labels
arch-wasm WebAssembly architecture area-Codegen-AOT-mono wasm-aot-test WebAssembly AOT Test
Milestone

Comments

@radical
Copy link
Member

radical commented Apr 27, 2021

System.Text.Json.Tests

Microsoft.Extensions.Logging.Generators.Roslyn3.11.Tests, log

emcc : error : '/datadisks/disk1/work/B7A309CE/p/build/emsdk/upstream/bin/wasm-opt --post-emscripten --no-exit-runtime -O2 --low-memory-unused --zero-filled-memory --strip-producers /datadisks/disk1/work/B7A309CE/w/B09909A9/e/wasm_build/obj/wasm/for-build/dotnet.wasm -o /datadisks/disk1/work/B7A309CE/w/B09909A9/e/wasm_build/obj/wasm/for-build/dotnet.wasm -g --mvp-features --enable-mutable-globals'
   failed (received SIGKILL (-9)) [/datadisks/disk1/work/B7A309CE/w/B09909A9/e/publish/ProxyProjectForAOTOnHelix.proj]
[ 2582.708542] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/helix.service,task=wasm-opt,pid=19587,uid=1000
[ 2582.708558] Out of memory: Killed process 19587 (wasm-opt) total-vm:7377176kB, anon-rss:7254096kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:14340kB oom_score_adj:0
[ 2582.925546] oom_reaper: reaped process 19587 (wasm-opt), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

The binlog should have details of why this failed. The binlog is included in the helix data.

@radical radical added arch-wasm WebAssembly architecture disabled-test The test is disabled in source code against the issue area-Codegen-AOT-mono wasm-aot-test WebAssembly AOT Test labels Apr 27, 2021
@ghost
Copy link

ghost commented Apr 27, 2021

Tagging subscribers to 'arch-wasm': @lewing
See info in area-owners.md if you want to be subscribed.

Issue Details

System.Text.Json.Tests

emcc : error : '/home/helixbot/work/AA8B0906/p/build/emsdk/upstream/bin/wasm-opt --post-emscripten --no-exit-runtime -Oz --low-memory-unused --zero-filled-memory --strip-producers /home/helixbot/work/AA8B0906/w/A4FB08F1/e/wasm_build/obj/wasm/dotnet.wasm -o /home/helixbot/work/AA8B0906/w/A4FB08F1/e/wasm_build/obj/wasm/dotnet.wasm -g --mvp-features --enable-mutable-globals' failed (-9) [/home/helixbot/work/AA8B0906/w/A4FB08F1/e/publish/AOTTestProjectForHelix.proj]
/home/helixbot/work/AA8B0906/p/build/wasm/WasmApp.targets(443,5): error MSB3073: The command "emcc "@/home/helixbot/work/AA8B0906/w/A4FB08F1/e/wasm_build/obj/wasm/emcc-link.rsp"" exited with code 1. [/home/helixbot/work/AA8B0906/w/A4FB08F1/e/publish/AOTTestProjectForHelix.proj]

The binlog should have details of why this failed. The binlog is included in the helix data.

Author: radical
Assignees: -
Labels:

arch-wasm, area-Codegen-AOT-mono, disabled-test, wasm-aot-test

Milestone: -

@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Apr 27, 2021
@lewing lewing removed the untriaged New issue has not been triaged by the area owner label Apr 28, 2021
@lewing lewing added this to the 6.0.0 milestone Apr 28, 2021
@vargaz
Copy link
Contributor

vargaz commented May 6, 2021

This probably happens because the code is just too big.

@radical
Copy link
Member Author

radical commented Jun 10, 2021

The current failure:

         cache:INFO:  - ok
          "/datadisks/disk1/work/A6A60972/p/build/emsdk/upstream/bin/wasm-ld" @/datadisks/disk1/work/A6A60972/t/emscripten_fqv4w7nk.rsp
     1>emcc : error : '/datadisks/disk1/work/A6A60972/p/build/emsdk/upstream/bin/wasm-ld @/datadisks/disk1/work/A6A60972/t/emscripten_fqv4w7nk.rsp' failed (-9) [/datadisks/disk1/work/A6A60972/w/A34508DD/e/publish/AOTTestProjectForHelix.proj]
     1>/datadisks/disk1/work/A6A60972/p/build/wasm/WasmApp.Native.targets(278,5): error MSB3073: The command "emcc "@/datadisks/disk1/work/A6A60972/p/build/microsoft.netcore.app.runtime.browser-wasm/runtimes/browser-wasm/native/src/emcc-default.rsp" "@/datadisks/disk1/work/A6A60972/w/A34508DD/e/wasm_build/obj/wasm/emcc-link.rsp"" exited with code 1. [/datadisks/disk1/work/A6A60972/w/A34508DD/e/publish/AOTTestProjectForHelix.proj]
     1>Done Building Project "/datadisks/disk1/work/A6A60972/w/A34508DD/e/publish/AOTTestProjectForHelix.proj" (default targets) -- FAILED.

@radical radical changed the title [wasm][aot] wasm-opt fails for System.Text.Json.Tests [wasm][aot] wasm-ld fails for System.Text.Json.Tests Jun 15, 2021
@lewing
Copy link
Member

lewing commented Jun 28, 2021

This should be fixed by #53603 and the flto changes

@lewing lewing closed this as completed Jun 28, 2021
@ghost ghost locked as resolved and limited conversation to collaborators Jul 28, 2021
@ericstj
Copy link
Member

ericstj commented Sep 19, 2021

I think this is still a problem. Hit this in release/6.0-rc2:


  Linking with emcc. This may take a while ...
   "/datadisks/disk1/work/B1890944/p/build/emsdk/upstream/bin/wasm-ld" @/datadisks/disk1/work/B1890944/t/emscripten_hxn4qmrx.rsp
   "/datadisks/disk1/work/B1890944/p/build/emsdk/upstream/bin/wasm-emscripten-finalize" --minimize-wasm-changes -g --dyncalls-i64 --dwarf /datadisks/disk1/work/B1890944/w/A44809CF/e/wasm_build/obj/wasm/for-build/dotnet.wasm -o /datadisks/disk1/work/B1890944/w/A44809CF/e/wasm_build/obj/wasm/for-build/dotnet.wasm --detect-features
   "/datadisks/disk1/work/B1890944/p/build/emsdk/node/14.15.5_64bit/bin/node" /datadisks/disk1/work/B1890944/p/build/emsdk/upstream/emscripten/src/compiler.js /datadisks/disk1/work/B1890944/t/tmpw7z52_k9.txt
   "/datadisks/disk1/work/B1890944/p/build/emsdk/upstream/bin/wasm-opt" --post-emscripten --no-exit-runtime -Oz --low-memory-unused --zero-filled-memory --strip-producers /datadisks/disk1/work/B1890944/w/A44809CF/e/wasm_build/obj/wasm/for-build/dotnet.wasm -o /datadisks/disk1/work/B1890944/w/A44809CF/e/wasm_build/obj/wasm/for-build/dotnet.wasm -g --mvp-features --enable-mutable-globals
emcc : error : '/datadisks/disk1/work/B1890944/p/build/emsdk/upstream/bin/wasm-opt --post-emscripten --no-exit-runtime -Oz --low-memory-unused --zero-filled-memory --strip-producers /datadisks/disk1/work/B1890944/w/A44809CF/e/wasm_build/obj/wasm/for-build/dotnet.wasm -o /datadisks/disk1/work/B1890944/w/A44809CF/e/wasm_build/obj/wasm/for-build/dotnet.wasm -g --mvp-features --enable-mutable-globals' failed (received SIGKILL (-9)) [/datadisks/disk1/work/B1890944/w/A44809CF/e/publish/ProxyProjectForAOTOnHelix.proj]
/datadisks/disk1/work/B1890944/p/build/wasm/WasmApp.Native.targets(369,5): error MSB3073: The command "emcc "@/datadisks/disk1/work/B1890944/p/build/microsoft.netcore.app.runtime.browser-wasm/runtimes/browser-wasm/native/src/emcc-default.rsp" "@/datadisks/disk1/work/B1890944/w/A44809CF/e/wasm_build/obj/wasm/for-build/emcc-link.rsp"" exited with code 1. [/datadisks/disk1/work/B1890944/w/A44809CF/e/publish/ProxyProjectForAOTOnHelix.proj]
XHarness artifacts: /datadisks/disk1/work/B1890944/w/A44809CF/uploads/xharness-output

https://helixre8s23ayyeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-59295-merge-2cb4a6848b884da6ae/System.Text.Json.Tests/1/console.380b5c81.log?sv=2019-07-07&se=2021-10-08T18%3A00%3A28Z&sr=c&sp=rl&sig=jvWokQtPb4dRituhcL%2FAy82QK5Rz6S7goic8ukzXI5E%3D

CC @layomia

@ericstj ericstj reopened this Sep 19, 2021
@radical radical removed disabled-test The test is disabled in source code against the issue blocking-aot-tracking-pr labels Sep 20, 2021
@radical radical self-assigned this Sep 20, 2021
@steveharter
Copy link
Member

steveharter commented Sep 20, 2021

Also likely causing issue for new logging generator tests. See #59092 (comment)

@radical
Copy link
Member Author

radical commented Sep 20, 2021

Also likely causing issue for new logging generator tests. See #59092 (comment)

That is failing when compiling Microsoft.CodeAnalysis.CSharp.dll.bc with emcc.

@radical radical changed the title [wasm][aot] wasm-ld fails for System.Text.Json.Tests [wasm][aot] Link step failing for System.Text.Json.Tests - wasm-opt getting killed Sep 20, 2021
@steveharter
Copy link
Member

That is failing when compiling Microsoft.CodeAnalysis.CSharp.dll.bc with emcc.

Yes to help determine what is causing the failure, there were a few temporary commits where the the new functional code and tests were no longer called, but the CI issue remained.

@radekdoulik
Copy link
Member

It might be the linker running out of memory again? Would disabling optimizations in linker -Wl,-lto-O0 help here?

@radical
Copy link
Member Author

radical commented Sep 23, 2021

This will get fixed by #59397 , same as it did on main.

@lewing
Copy link
Member

lewing commented Sep 23, 2021

This will get fixed by #59397 , same as it did on main.

#59092 is not fixed by that

@danmoseley
Copy link
Member

I see this

emcc : error : '/datadisks/disk1/work/B1D309C1/p/build/emsdk/upstream/bin/wasm-opt --post-emscripten --no-exit-runtime -Oz --low-memory-unused --zero-filled-memory --strip-producers /datadisks/disk1/work/B1D309C1/w/99A208F6/e/wasm_build/obj/wasm/for-build/dotnet.wasm -o /datadisks/disk1/work/B1D309C1/w/99A208F6/e/wasm_build/obj/wasm/for-build/dotnet.wasm -g --mvp-features --enable-mutable-globals' failed (received SIGKILL (-9)) [/datadisks/disk1/work/B1D309C1/w/99A208F6/e/publish/ProxyProjectForAOTOnHelix.proj]
/datadisks/disk1/work/B1D309C1/p/build/wasm/WasmApp.Native.targets(374,5): error MSB3073: The command "emcc "@/datadisks/disk1/work/B1D309C1/p/build/microsoft.netcore.app.runtime.browser-wasm/runtimes/browser-wasm/native/src/emcc-default.rsp" "@/datadisks/disk1/work/B1D309C1/w/99A208F6/e/wasm_build/obj/wasm/for-build/emcc-link.rsp"" exited with code 1. [/datadisks/disk1/work/B1D309C1/w/99A208F6/e/publish/ProxyProjectForAOTOnHelix.proj]
XHarness artifacts: /datadisks/disk1/work/B1D309C1/w/99A208F6/uploads/xharness-output

https://helixre8s23ayyeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-60112-merge-1c763c950edf44f2b0/System.Text.Json.Tests/1/console.0ff5206b.log?sv=2019-07-07&se=2021-10-27T13%3A48%3A50Z&sr=c&sp=rl&sig=Vlj0U9Lm4w1sYqrLDYGVsnvTmUldx6%2B0sr2PWre4tyI%3D
Console log: 'System.Text.Json.Tests' from job 1c763c95-0edf-44f2-b01f-ab2dd0c22a82 workitem 64f04465-1537-4154-b8a9-f39b9be56ef2 (ubuntu.1804.amd64.open.svc) executed on machine a001G8K

this ran on 10/7/2021, 8:32:46 AM. from the comments above I'm not sure whether this was expected to be fixed (this is 6.0)

@radical
Copy link
Member Author

radical commented Oct 8, 2021

wasm-opt getting killed due to OOM: failed (received SIGKILL (-9)). We are seeing more of oom kills recently with STJ, and test projects that use source generators.

In some cases they fail when compiling an assembly's bitcode to wasm .o, and in some it fails when linking (including wasm-opt).

I plan to:

  1. try to detect this case (very roughly), and dump some logs to help with debugging
  2. maybe skip some of the big assemblies that fail when compiling
  3. Explore using different optimization flags to reduce the memory footprint (we did try this a bit, and it didn't help)

radical added a commit to radical/runtime that referenced this issue Oct 27, 2021
- `Microsoft.Extensions.Logging.Generators.Roslyn3.11.Tests`
- `Microsoft.Extensions.Logging.Generators.Roslyn4.0.Tests`
- `System.Text.Json.SourceGeneration.Roslyn3.11.Unit.Tests`
- `System.Text.Json.SourceGeneration.Roslyn4.0.Unit.Tests`
- `System.Text.RegularExpressions.Generators.Tests`

Some individual ones are disabled due to
dotnet#58226, and
dotnet#60899 .

Fixes dotnet#51961 .
radical added a commit that referenced this issue Nov 2, 2021
This adds support for setting per-project optimization flags, and sets them for these projects so they don't OOM.

Microsoft.Extensions.Logging.Generators.Roslyn3.11.Tests
Microsoft.Extensions.Logging.Generators.Roslyn4.0.Tests
System.Text.Json.SourceGeneration.Roslyn3.11.Unit.Tests
System.Text.Json.SourceGeneration.Roslyn4.0.Unit.Tests
System.Text.RegularExpressions.Generators.Tests
Some individual ones are disabled due to
#58226, and #60899 .

Additionally, if a AOT build fails, then on linux it dumps the last few lines from dmesg, to help identify it was an oom-kill.

Fixes #51961 .

Co-authored-by: Larry Ewing <lewing@microsoft.com>
@krwq
Copy link
Member

krwq commented Nov 9, 2021

reopenning since we've re-disabled tests

@krwq krwq reopened this Nov 9, 2021
@radical radical changed the title [wasm][aot] Link step failing for System.Text.Json.Tests - wasm-opt getting killed [wasm][aot] Link step failing for Microsoft.Extensions.Logging.Generators.Roslyn3.11.Tests - wasm-opt getting killed Nov 9, 2021
@radical radical modified the milestones: 6.0.0, 7.0.0 Nov 9, 2021
@radical radical removed their assignment Nov 9, 2021
@lewing
Copy link
Member

lewing commented Nov 14, 2021

@ilyas1974 we are running into this over and over, what can we do to get more memory on the VM.

@directhex
Copy link
Contributor

Struggling to repro this locally. 😰

@directhex
Copy link
Contributor

Oh no, I can reproduce it locally, and learned something very cursed about Docker

@directhex
Copy link
Contributor

directhex commented Jan 20, 2022

Okay. Findings:

  • The job in question runs on Helix on a bare D2a_v4 Azure VM (2 cores of AMD Epyc, 8GB RAM)
  • Serializing the build, and using a less-memory-intensive compiler optimization level are about the best we can do to reduce the memory usage, and we already do them
  • The Clang compile stage requires 7.8 gig of RAM. If it's failing on the clang step in CI, then other basic OS stuff on the VM is likely using enough RAM that Clang can't get the 7.8 gig it wants
  • The final step of the build, the call to wasm-opt, requires more memory than Clang - 8.5 gig.
  • Swap exists 😬

@vargaz
Copy link
Contributor

vargaz commented Jan 20, 2022

The problem is AOTing roslyn, a linked Microsoft.CodeAnalysis.CSharp.dll still contains 30k methods, AOT creates 80k methods out of this because roslyn contains a lot of valuetypes causing a lot of generic instances to be added, running the aot compiler on this assembly alone takes 40 seconds, the .bc -> .o step takes 10 minutes (!).

@vargaz
Copy link
Contributor

vargaz commented Jan 20, 2022

We might want to add some functionality to the build system to exclude certain assemblies like roslyn from AOTing.

@radical
Copy link
Member Author

radical commented Jan 20, 2022

We might want to add some functionality to the build system to exclude certain assemblies like roslyn from AOTing.

We do have that, as a hack (https://github.com/dotnet/runtime/blob/main/src/tasks/AotCompilerTask/MonoAOTCompiler.cs#L488-L489).

@directhex
Copy link
Contributor

So what's the play, try to shoehorn the test as-is onto Helix, or try to wire up the AOT skip?

@vargaz
Copy link
Contributor

vargaz commented Jan 20, 2022

I'd suggest skipping AOTing for these assemblies, they take forever to AOT which also impacts CI performance even if it works.

@vargaz vargaz self-assigned this Jan 21, 2022
@directhex
Copy link
Contributor

Failure is at

if (amodule->out_of_date) {
mono_trace (G_LOG_LEVEL_DEBUG, MONO_TRACE_AOT, "AOT: Module %s is unusable because a dependency is out-of-date.", assembly->image->name);
if (mono_aot_only)
g_error ("Failed to load AOT module '%s' while running in aot-only mode because a dependency cannot be found or it is out of date.\n", found_aot_name);
which makes sense

@vargaz
Copy link
Contributor

vargaz commented Jan 24, 2022

Is there a PR which skips the AOT step for these assemblies ?

@vargaz
Copy link
Contributor

vargaz commented Jan 24, 2022

An alternative would be compiling with O1, i.e.:

    <EmccLinkOptimizationFlag>-O1</EmccLinkOptimizationFlag>

Running with this reduced the total runtime from 16 minutes to 11 minutes on my machine, it might solve the OOM problem as well.

@directhex
Copy link
Contributor

directhex commented Jan 24, 2022 via email

@vargaz
Copy link
Contributor

vargaz commented Jan 25, 2022

#64294

@elinor-fung
Copy link
Member

@directhex @radical - the latest rolling build (which has #63958) still hit this.

Still looks to be an OOM, but I don't know how to get more information about the exact cause.

emcc : error : '/datadisks/disk1/work/A50F091E/p/build/emsdk/upstream/bin/wasm-emscripten-finalize --minimize-wasm-changes -g --dyncalls-i64 --dwarf /datadisks/disk1/work/A50F091E/w/B308094D/e/wasm_build/obj/wasm/for-build/dotnet.wasm -o /datadisks/disk1/work/A50F091E/w/B308094D/e/wasm_build/obj/wasm/for-build/dotnet.wasm --detect-features' failed (received SIGKILL (-9)) [/datadisks/disk1/work/A50F091E/w/B308094D/e/publish/ProxyProjectForAOTOnHelix.proj]
/datadisks/disk1/work/A50F091E/p/build/wasm/WasmApp.Native.targets(404,5): error MSB3073: The command "emcc "@/datadisks/disk1/work/A50F091E/p/build/microsoft.netcore.app.runtime.browser-wasm/runtimes/browser-wasm/native/src/emcc-default.rsp" "@/datadisks/disk1/work/A50F091E/p/build/microsoft.netcore.app.runtime.browser-wasm/runtimes/browser-wasm/native/src/emcc-link.rsp" "@/datadisks/disk1/work/A50F091E/w/B308094D/e/wasm_build/obj/wasm/for-build/emcc-link.rsp"" exited with code 1. [/datadisks/disk1/work/A50F091E/w/B308094D/e/publish/ProxyProjectForAOTOnHelix.proj]
[ 5962.061349] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/helix.service,task=wasm-emscripten,pid=23875,uid=1000
[ 5962.061364] Out of memory: Killed process 23875 (wasm-emscripten) total-vm:5798616kB, anon-rss:5590440kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:11100kB oom_score_adj:0
[ 5962.258803] oom_reaper: reaped process 23875 (wasm-emscripten), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.