JIT.jit64.mcc fails on newly enabled Windows arm32 queue Windows.10.Arm64v8.Open #32320

trylek · 2020-02-14T19:11:17Z

After I enabled Windows arm32 runs using the new Galaxy Book laptop queue (Windows.10.Arm64v8.Open), we’re starting to monitor the first errors on that queue 😊. We now see a weirdly systematic error in the “JIT.jit64.mcc” work item, for instance in this run:

https://dev.azure.com/dnceng/public/_build/results?buildId=521323&view=logs&jobId=6c46bee0-e095-5eff-8d48-d352951d0d7b

It has two different manifestations: either the Helix log is not available at all (in the quoted run this is the case for the “no_tiered_compilation” flavor of the Windows arm32 job), or it’s present (like in the other Windows arm32 job in the same run) and complains about the missing XUnit wrapper for the test:

C:\h\w\AE7509C1\w\A981096F\e>C:\h\w\AE7509C1\p\CoreRun.exe C:\h\w\AE7509C1\p\xunit.console.dll JIT\jit64\JIT.jit64.XUnitWrapper.dll -parallel collections -nocolor -noshadow -xml testResults.xml -trait TestGroup=JIT.jit64.mcc 
Unhandled exception. System.ArgumentException: file not found: JIT\jit64\JIT.jit64.XUnitWrapper.dll
   at Xunit.ConsoleClient.CommandLine.Parse(Predicate`1 fileExists) in /_/src/Microsoft.DotNet.XUnitConsoleRunner/src/CommandLine.cs:line 217
   at Xunit.ConsoleClient.CommandLine..ctor(String[] args, Predicate`1 fileExists) in /_/src/Microsoft.DotNet.XUnitConsoleRunner/src/CommandLine.cs:line 21
   at Xunit.ConsoleClient.CommandLine.Parse(String[] args) in /_/src/Microsoft.DotNet.XUnitConsoleRunner/src/CommandLine.cs:line 110
   at Xunit.ConsoleClient.ConsoleRunner.EntryPoint(String[] args) in /_/src/Microsoft.DotNet.XUnitConsoleRunner/src/ConsoleRunner.cs:line 31
   at Xunit.ConsoleClient.Program.Main(String[] args) in /_/src/Microsoft.DotNet.XUnitConsoleRunner/src/Program.cs:line 16

There are about 20 xUnit wrappers getting generated in the Pri0 runs and all the others apparently succeeded; I also see in the step “Copy native components to test output folder” of the job CoreCLR Pri0 Test Run Windows_NT arm checked that the JIT.jit64.XUnitWrapper.dll is generated fine just like all the other wrappers.

@MattGal, Is there any magic you might be able to pull off to help us better understand what’s going on, whether it’s a reliability issue of the newly brought up machines or perhaps a particular machine, and / or how to investigate that further?

The text was updated successfully, but these errors were encountered:

trylek · 2020-02-14T19:12:36Z

P.S. I tried to look whether the failures are confined to a single machine but I found at least two, DDARM64-056 and DDARM64-110.

jaredpar · 2020-02-14T19:17:20Z

Builds

Build	Pull Request	Test Failure Count
#521184	#32265	1
#521210	Rolling	2
#521211	#32227	2
#521254	Rolling	2
#521323	#32292	3
#521405	#32265	2
#521426	#32227	2
#521515	#32299	2
#521519	Rolling	2
#521556	#32302	2

Configurations

Windows_NT arm Checked @ Windows.10.Arm64v8.Open
Windows_NT arm Checked no_tiered_compilation @ Windows.10.Arm64v8.Open

Helix Logs

Build	Pull Request	Console
#521184	#32265
#521210	Rolling	console.7f2a7422.log
#521210	Rolling	console.694e6413.log
#521211	#32227	console.653c707e.log
#521211	#32227
#521254	Rolling	console.e360eac4.log
#521254	Rolling	console.f1862260.log
#521323	#32292	console.7ded7a33.log
#521323	#32292
#521323	#32292	console.88e0d943.log
#521405	#32265	console.5c11eead.log
#521405	#32265
#521426	#32227
#521426	#32227
#521515	#32299	console.9b5c908e.log
#521515	#32299	console.123bd29d.log
#521519	Rolling	console.677db533.log
#521519	Rolling
#521556	#32302	console.1fc286e8.log
#521556	#32302	console.30a0b3cc.log

jkotas · 2020-02-15T15:08:11Z

This is affecting large fraction of the PRs. I am trying to disable these tests in #32372 until it is fixed.

jkotas · 2020-02-16T01:27:09Z

Disabling the mcc tests moved the failure to the next workitem. It means that the failure is not specific to mcc tests. The mcc tests just happen to be a victim due to ordering.

@trylek Can we disable the ARM runs until this is fixed?

Constantly failing: #32320

ViktorHofer · 2020-02-16T17:57:00Z

Submitted a PR to disable coreclr's test execution on ARM: #32404

ViktorHofer · 2020-02-16T18:06:46Z

Tomas already disabled them: d9bc547

trylek · 2020-02-17T17:56:32Z

FWIW, one thing occurred to me during my yesterday chat with Viktor: when I was standing up the queue of Galaxy Book laptops for .NET Native testing about 1 1/2 years ago, I was hitting weird reliability issues that I later found out to be caused by the fact that the Windows installation on these laptops was continually spewing some internal crash dumps onto the relatively small HDD that was soon overflowing.

I ended up talking to some Watson folks who recommended setting a magic environment variable which ultimately fixed that. I'm not saying this is necessarily the cause here but I can easily imagine that some of the weird symptoms like the absence of relationship to a particular workload or non-deterministic absence of logs could be explained by lack of disk space.

trylek · 2020-02-24T20:09:44Z

Adding link to the related older item for reference: #1097

MattGal · 2020-02-25T21:56:26Z

Sorry for joining the party late, I am taking a look now.

trylek · 2020-02-25T22:10:16Z

No worries @MattGal, I have launched a fresh new run re-enabling the Windows ARM32 job so I should have a new set of results available shortly:

#32819

MattGal · 2020-02-25T22:15:46Z

No worries @MattGal, I have launched a fresh new run re-enabling the Windows ARM32 job so I should have a new set of results available shortly:

#32819

I don't think I actually need the new run, the most recent one died from running out of disk space. Investigation continues.

MattGal · 2020-02-25T22:55:22Z

I spent some time pondering JIT.jit64.mcc and the logs. It's clear that the problem is that we never really expected 3+ GB work item payloads, but we can make it work. NOte if it's slow to unzip on MY computer, it's slow to unzip on the helix laptops. Sample log.

Workitem payload zip:
693 MB (709,428KB)
Correlation Payload zips:
only big one is 248 MB (254,382KB)

= ~941 MB zipped (963,810‬ KB)

These zips have to keep existing until the work is finished.

Once unpacked:
Work item zip goes from 693 MB -> 3.382 GB (3,571,486,720)

Because a work item a) might get rerun and b) might munge its own directory, we have two copies of this and re-copy from the "unzip" to "exec" folder every time.

Corelation payload zip goes from 248 -> 848 MB

That means just having this work item unpacked eats 3.382 + 3.382 + .848 + .963 GB = 8.575 gigs for just the work item, forgetting logs, dumps etc.

Things we should pursue:

Runtime folks: Consider splitting the Jit work item into multiple work items. If this were not 56 different tests zipped into the same giant zip it wouldn't have hit this issue.
.NET Core Engineering: We could get all the extra disk space we'd need by moving the Helix work directory on these machines to D:, which has 60 GB free instead of 10.

trylek · 2020-02-25T23:51:07Z

I believe that @echesakovMSFT was working on partitioning CoreCLR tests into chunks. I remember from my .NET Native test migration to Helix that we ended up with vastly different characteristics of Intel vs. ARM work items in terms of size. During our chat in Redmond @jashook mentioned that the current design is very inflexible in terms of adjustable work item sizes. If this turns out to be a crucial factor for ARM testing, we might want to rethink some of the infra logic with new goals in mind like clean Mono support or tagging tests for OS independence.

echesakov · 2020-02-26T00:37:59Z

@trylek If we need to solve this issue now - we can specify a finer partitioning of JIT.jit64.mcc work item in src\coreclr\tests\testgrouping.proj. Since it's a MSBuild file you can also put conditions on $(BuildOS) and $(BuildArch) and limit this partitioning when targeting win-arm.

It's also doable to have a separate partitioning scheme for each combination of $(BuildOS) x $(BuildArch) if needed, so I am not sure what @jashook means by "very inflexible in terms of adjustable work item sizes.". On opposite, it's quite flexible - you can have a work item consisting of one test as well as a work item consisting of tests in multiple directories. However, the problem of figuring out the right partitioning scheme is hard, especially, when you want to minimize not only the time each work item takes to run but also time that is takes to upload/download a workitem payload and unpack it on a Helix machine. This is way we done this only testing on x64.

@MattGal By the way, if I remember right - the work item size is why I JIT/jit64 was split into multiple work items in the first place. All the test artifacts in jit64 directory on Windows takes roughly 6Gb and when we were bringing up the testing in Helix in coreclr this was too much even for x64.

MattGal · 2020-02-26T00:41:10Z

@echesakovMSFT thanks for clarifying. Once we fix up the machines they should have lots more space such that you don't have to change, but unless it will result in duplication of content in work item payloads more, smaller workitems will generally make it through helix faster.

MattGal · 2020-02-26T16:45:17Z

I've fixed up these machines so they have their work directory on the 60 GB free disk, so you now have 50 more GB to play with on the work directory. Do note with payloads this big that downloading and unzipping them is going to be a non-trivial part of their execution; not much we can do about that.

@trylek can you kick off a fresh run?

trylek · 2020-02-27T19:57:24Z

Thanks @MattGal. Closed & reopened the PR; the results are kind of weird - the summary in the PR indicates that the Windows legs are still running but in Azure it shows they failed. For the Windows ARM run, if I read the log correctly, it claims that it lost connection to the pool.

https://dev.azure.com/dnceng/public/_build/results?buildId=537911&view=logs&jobId=3ebbb5e8-da96-58d1-d7f8-eda9a2949a98&j=41021207-15b4-5953-02cc-987654ff0f7b

CoreCLR Pri0 Test Run Windows_NT arm checked:

##[error]We stopped hearing from agent NetCorePublic-Pool 8. Verify the agent machine is running and has a healthy network connection. Anything that terminates an agent process, starves it for CPU, or blocks its network access can cause this error. For more information, see: https://go.microsoft.com/fwlink/?linkid=846610
Pool: NetCorePublic-Pool
Agent: NetCorePublic-Pool 8
Started: Yesterday at 10:16 PM
Duration: 1h 31m 13s

Job preparation parameters
7 queue time variables used

MattGal · 2020-02-28T00:36:16Z

This is super weird. For now, let's just retry them and if it comes back I can investigate.

ViktorHofer · 2020-03-25T12:55:04Z

Failed again: https://dev.azure.com/dnceng/public/_build/results?buildId=572416&view=ms.vss-test-web.build-test-results-tab&runId=17950088&paneView=debug&resultId=100000

trylek · 2020-03-25T16:41:14Z

@MattGal - is there any way to double-check whether this is a one-off issue or a problem with a particular machine? This is the first occurrence in about 5 days so I'm not that scared yet but if this starts reproing on a more regular basis, I'll be strongly pushed to disable the ARM runs again. Thanks a lot!

MattGal · 2020-03-25T18:31:45Z

Yes, actually; it's not terribly hard to use Kusto queries to see if a particular machine is an outlier for your work, given enough work items. I'll take a peek.

My understand here is most of the "fix" was basically a refactoring of payloads to not be 700+ MB per work item; if that regressed on your side it could be relevant.

MattGal · 2020-03-25T18:52:53Z

The work item Viktor linked failed due to downloading and unpacking its payload filling the disk... so the "good" part here is that network speed isn't the problem (i.e. you were able in all cases as far as I can see to download the work, just not unpack it.

(Last log before deadletter)

2020-03-25T12:25:14.834Z	ERROR  	executor(112)	run	Unhandled exception attempting to download payloads
Traceback (most recent call last):
  File "C:\h\scripts\helix-scripts\helix\executor.py", line 108, in run
    self._download_workitem_payload(workitem_payload_archive_uri)
  File "C:\h\scripts\helix-scripts\helix\executor.py", line 170, in _download_workitem_payload
    copy_tree_to(self.workitem_payload_dir, self.workitem_root)
  File "C:\h\scripts\helix-scripts\helix\io.py", line 28, in copy_tree_to
    shutil.copy2(path, file_target_path)
  File "C:\python3.7.0\lib\shutil.py", line 257, in copy2
    copyfile(src, dst, follow_symlinks=follow_symlinks)
  File "C:\python3.7.0\lib\shutil.py", line 122, in copyfile
    copyfileobj(fsrc, fdst)
  File "C:\python3.7.0\lib\shutil.py", line 82, in copyfileobj
    fdst.write(buf)
OSError: [Errno 28] No space left on device
2020-03-25T12:25:14.850Z	INFO   	executor(115)	run	Exception downloading.  Closing file handle to D:\Users\runner\AppData\Local\Temp\helix_active_download_a0a5ebb5-c046-43b1-ae74-4d98ec602202.sem

Looking at the zip file for the work item it's still 694 MB zipped. @jashook was working on reducing this, are these runs perhaps missing his changes?

Querying general failures like this in the past week, there's no trend of any specific machine hitting this more often than others. Rather, your single work item's payload (ignoring all correlation payloads) is still well over 3 GB (Unpacking the one above shows it as 3.32 GB on my local computer). As we discussed before, since your "single" work item payloads are just lots of tests, the simplest and best fix is to split them up. I see something like 56 (split across 798 DLLs) distinct tests in this same work item. If you can figure out how to send that as two bursts of 28, your payload size will drop by approximately half. If you can figure out how to send that as four bursts of 14 work items, it will drop by 75%. If you make each test a distinct work item, payload size drops by a whopping 56-fold, and maximizes usage of the machines available.

trylek · 2020-03-25T19:59:20Z

Thanks @MattGal for following up so quickly, that really sounds promising. Jared started looking into our artifact sizes in general as some of them seem ridiculously big; I guess that, once we're done with some initial inventory as to what artifacts we really need and what can be thrown away, we should reassess this and discuss the work item partitioning policy.

MattGal · 2020-03-25T20:27:39Z

Thanks @MattGal for following up so quickly, that really sounds promising. Jared started looking into our artifact sizes in general as some of them seem ridiculously big; I guess that, once we're done with some initial inventory as to what artifacts we really need and what can be thrown away, we should reassess this and discuss the work item partitioning policy.

Chatting with Jarrett, he also reminded me of what I did earlier in this thread; something put the variable back to C:\ here. I can resurrect the sneaky trick I did to undo this while reaching out to DDFUN to understand why it may have regressed. Will update this thread once done.

MattGal · 2020-03-25T21:28:01Z

Thanks @MattGal for following up so quickly, that really sounds promising. Jared started looking into our artifact sizes in general as some of them seem ridiculously big; I guess that, once we're done with some initial inventory as to what artifacts we really need and what can be thrown away, we should reassess this and discuss the work item partitioning policy.

I met with Jarett and he reminded me we'd already done this for other machines, just not evidently this queue. I've updated the machines again and discussed with DDFUN, so you should be unblocked. (Edit: Evidently some machines from the queue got re-imaged with old scripts, and the manual fixup steps where not followed, this is the fallout)

trylek · 2020-03-25T23:32:08Z

Awesome, thank you!

ViktorHofer · 2020-07-15T11:21:49Z

Presumably this isn't happening anymore. Closing. Feel free to reopen.

trylek added the area-Infrastructure label Feb 14, 2020

Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Feb 14, 2020

jkotas mentioned this issue Feb 15, 2020

Cleanup old links #32319

Merged

jkotas added the blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms' label Feb 15, 2020

AndyAyersMS mentioned this issue Feb 15, 2020

JIT: note when simd store coalescing produces a full local field. #32324

Merged

ViktorHofer added a commit that referenced this issue Feb 16, 2020

Disable coreclr managed test run on ARM

d4f3092

Constantly failing: #32320

ViktorHofer mentioned this issue Feb 16, 2020

Disable coreclr managed test run on ARM #32404

Closed

ViktorHofer mentioned this issue Feb 18, 2020

Update ILLinkTasksVersion dependency #32170

Merged

jkotas removed the blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms' label Mar 1, 2020

ViktorHofer mentioned this issue Mar 25, 2020

Update reportgenerator version used #34059

Merged

jashook mentioned this issue Mar 25, 2020

Prune all helix payloads #33438

Closed

ViktorHofer removed the untriaged New issue has not been triaged by the area owner label Jul 15, 2020

ViktorHofer closed this as completed Jul 15, 2020

ghost locked as resolved and limited conversation to collaborators Dec 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JIT.jit64.mcc fails on newly enabled Windows arm32 queue Windows.10.Arm64v8.Open #32320

JIT.jit64.mcc fails on newly enabled Windows arm32 queue Windows.10.Arm64v8.Open #32320

trylek commented Feb 14, 2020

trylek commented Feb 14, 2020

jaredpar commented Feb 14, 2020

jkotas commented Feb 15, 2020

jkotas commented Feb 16, 2020

ViktorHofer commented Feb 16, 2020 •

edited

Loading

ViktorHofer commented Feb 16, 2020

trylek commented Feb 17, 2020

trylek commented Feb 24, 2020

MattGal commented Feb 25, 2020

trylek commented Feb 25, 2020

MattGal commented Feb 25, 2020

MattGal commented Feb 25, 2020 •

edited

Loading

trylek commented Feb 25, 2020

echesakov commented Feb 26, 2020

MattGal commented Feb 26, 2020

MattGal commented Feb 26, 2020

trylek commented Feb 27, 2020

MattGal commented Feb 28, 2020

ViktorHofer commented Mar 25, 2020

trylek commented Mar 25, 2020

MattGal commented Mar 25, 2020

MattGal commented Mar 25, 2020

trylek commented Mar 25, 2020

MattGal commented Mar 25, 2020

MattGal commented Mar 25, 2020 •

edited

Loading

trylek commented Mar 25, 2020

ViktorHofer commented Jul 15, 2020

JIT.jit64.mcc fails on newly enabled Windows arm32 queue Windows.10.Arm64v8.Open #32320

JIT.jit64.mcc fails on newly enabled Windows arm32 queue Windows.10.Arm64v8.Open #32320

Comments

trylek commented Feb 14, 2020

trylek commented Feb 14, 2020

jaredpar commented Feb 14, 2020

Builds

Configurations

Helix Logs

jkotas commented Feb 15, 2020

jkotas commented Feb 16, 2020

ViktorHofer commented Feb 16, 2020 • edited Loading

ViktorHofer commented Feb 16, 2020

trylek commented Feb 17, 2020

trylek commented Feb 24, 2020

MattGal commented Feb 25, 2020

trylek commented Feb 25, 2020

MattGal commented Feb 25, 2020

MattGal commented Feb 25, 2020 • edited Loading

trylek commented Feb 25, 2020

echesakov commented Feb 26, 2020

MattGal commented Feb 26, 2020

MattGal commented Feb 26, 2020

trylek commented Feb 27, 2020

MattGal commented Feb 28, 2020

ViktorHofer commented Mar 25, 2020

trylek commented Mar 25, 2020

MattGal commented Mar 25, 2020

MattGal commented Mar 25, 2020

trylek commented Mar 25, 2020

MattGal commented Mar 25, 2020

MattGal commented Mar 25, 2020 • edited Loading

trylek commented Mar 25, 2020

ViktorHofer commented Jul 15, 2020

ViktorHofer commented Feb 16, 2020 •

edited

Loading

MattGal commented Feb 25, 2020 •

edited

Loading

MattGal commented Mar 25, 2020 •

edited

Loading