-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT.jit64.mcc fails on newly enabled Windows arm32 queue Windows.10.Arm64v8.Open #32320
Comments
P.S. I tried to look whether the failures are confined to a single machine but I found at least two, DDARM64-056 and DDARM64-110. |
Builds
Configurations
Helix Logs
|
This is affecting large fraction of the PRs. I am trying to disable these tests in #32372 until it is fixed. |
Disabling the mcc tests moved the failure to the next workitem. It means that the failure is not specific to mcc tests. The mcc tests just happen to be a victim due to ordering. @trylek Can we disable the ARM runs until this is fixed? |
Submitted a PR to disable coreclr's test execution on ARM: #32404 |
Tomas already disabled them: d9bc547 |
FWIW, one thing occurred to me during my yesterday chat with Viktor: when I was standing up the queue of Galaxy Book laptops for .NET Native testing about 1 1/2 years ago, I was hitting weird reliability issues that I later found out to be caused by the fact that the Windows installation on these laptops was continually spewing some internal crash dumps onto the relatively small HDD that was soon overflowing. I ended up talking to some Watson folks who recommended setting a magic environment variable which ultimately fixed that. I'm not saying this is necessarily the cause here but I can easily imagine that some of the weird symptoms like the absence of relationship to a particular workload or non-deterministic absence of logs could be explained by lack of disk space. |
Adding link to the related older item for reference: #1097 |
Sorry for joining the party late, I am taking a look now. |
I spent some time pondering JIT.jit64.mcc and the logs. It's clear that the problem is that we never really expected 3+ GB work item payloads, but we can make it work. NOte if it's slow to unzip on MY computer, it's slow to unzip on the helix laptops. Sample log. Workitem payload zip: = ~941 MB zipped (963,810 KB) These zips have to keep existing until the work is finished. Once unpacked: Because a work item a) might get rerun and b) might munge its own directory, we have two copies of this and re-copy from the "unzip" to "exec" folder every time. Corelation payload zip goes from 248 -> 848 MB That means just having this work item unpacked eats 3.382 + 3.382 + .848 + .963 GB = 8.575 gigs for just the work item, forgetting logs, dumps etc. Things we should pursue:
|
I believe that @echesakovMSFT was working on partitioning CoreCLR tests into chunks. I remember from my .NET Native test migration to Helix that we ended up with vastly different characteristics of Intel vs. ARM work items in terms of size. During our chat in Redmond @jashook mentioned that the current design is very inflexible in terms of adjustable work item sizes. If this turns out to be a crucial factor for ARM testing, we might want to rethink some of the infra logic with new goals in mind like clean Mono support or tagging tests for OS independence. |
@trylek If we need to solve this issue now - we can specify a finer partitioning of JIT.jit64.mcc work item in src\coreclr\tests\testgrouping.proj. Since it's a MSBuild file you can also put conditions on It's also doable to have a separate partitioning scheme for each combination of @MattGal By the way, if I remember right - the work item size is why I JIT/jit64 was split into multiple work items in the first place. All the test artifacts in jit64 directory on Windows takes roughly 6Gb and when we were bringing up the testing in Helix in coreclr this was too much even for x64. |
@echesakovMSFT thanks for clarifying. Once we fix up the machines they should have lots more space such that you don't have to change, but unless it will result in duplication of content in work item payloads more, smaller workitems will generally make it through helix faster. |
I've fixed up these machines so they have their work directory on the 60 GB free disk, so you now have 50 more GB to play with on the work directory. Do note with payloads this big that downloading and unzipping them is going to be a non-trivial part of their execution; not much we can do about that. @trylek can you kick off a fresh run? |
Thanks @MattGal. Closed & reopened the PR; the results are kind of weird - the summary in the PR indicates that the Windows legs are still running but in Azure it shows they failed. For the Windows ARM run, if I read the log correctly, it claims that it lost connection to the pool. CoreCLR Pri0 Test Run Windows_NT arm checked: ##[error]We stopped hearing from agent NetCorePublic-Pool 8. Verify the agent machine is running and has a healthy network connection. Anything that terminates an agent process, starves it for CPU, or blocks its network access can cause this error. For more information, see: https://go.microsoft.com/fwlink/?linkid=846610 Pool: NetCorePublic-Pool Agent: NetCorePublic-Pool 8 Started: Yesterday at 10:16 PM Duration: 1h 31m 13s Job preparation parameters 7 queue time variables used |
This is super weird. For now, let's just retry them and if it comes back I can investigate. |
@MattGal - is there any way to double-check whether this is a one-off issue or a problem with a particular machine? This is the first occurrence in about 5 days so I'm not that scared yet but if this starts reproing on a more regular basis, I'll be strongly pushed to disable the ARM runs again. Thanks a lot! |
Yes, actually; it's not terribly hard to use Kusto queries to see if a particular machine is an outlier for your work, given enough work items. I'll take a peek. My understand here is most of the "fix" was basically a refactoring of payloads to not be 700+ MB per work item; if that regressed on your side it could be relevant. |
The work item Viktor linked failed due to downloading and unpacking its payload filling the disk... so the "good" part here is that network speed isn't the problem (i.e. you were able in all cases as far as I can see to download the work, just not unpack it.
Looking at the zip file for the work item it's still 694 MB zipped. @jashook was working on reducing this, are these runs perhaps missing his changes? Querying general failures like this in the past week, there's no trend of any specific machine hitting this more often than others. Rather, your single work item's payload (ignoring all correlation payloads) is still well over 3 GB (Unpacking the one above shows it as 3.32 GB on my local computer). As we discussed before, since your "single" work item payloads are just lots of tests, the simplest and best fix is to split them up. I see something like 56 (split across 798 DLLs) distinct tests in this same work item. If you can figure out how to send that as two bursts of 28, your payload size will drop by approximately half. If you can figure out how to send that as four bursts of 14 work items, it will drop by 75%. If you make each test a distinct work item, payload size drops by a whopping 56-fold, and maximizes usage of the machines available. |
Thanks @MattGal for following up so quickly, that really sounds promising. Jared started looking into our artifact sizes in general as some of them seem ridiculously big; I guess that, once we're done with some initial inventory as to what artifacts we really need and what can be thrown away, we should reassess this and discuss the work item partitioning policy. |
Chatting with Jarrett, he also reminded me of what I did earlier in this thread; something put the variable back to C:\ here. I can resurrect the sneaky trick I did to undo this while reaching out to DDFUN to understand why it may have regressed. Will update this thread once done. |
I met with Jarett and he reminded me we'd already done this for other machines, just not evidently this queue. I've updated the machines again and discussed with DDFUN, so you should be unblocked. (Edit: Evidently some machines from the queue got re-imaged with old scripts, and the manual fixup steps where not followed, this is the fallout) |
Awesome, thank you! |
Presumably this isn't happening anymore. Closing. Feel free to reopen. |
After I enabled Windows arm32 runs using the new Galaxy Book laptop queue (Windows.10.Arm64v8.Open), we’re starting to monitor the first errors on that queue 😊. We now see a weirdly systematic error in the “JIT.jit64.mcc” work item, for instance in this run:
https://dev.azure.com/dnceng/public/_build/results?buildId=521323&view=logs&jobId=6c46bee0-e095-5eff-8d48-d352951d0d7b
It has two different manifestations: either the Helix log is not available at all (in the quoted run this is the case for the “no_tiered_compilation” flavor of the Windows arm32 job), or it’s present (like in the other Windows arm32 job in the same run) and complains about the missing XUnit wrapper for the test:
There are about 20 xUnit wrappers getting generated in the Pri0 runs and all the others apparently succeeded; I also see in the step “Copy native components to test output folder” of the job
CoreCLR Pri0 Test Run Windows_NT arm checked
that theJIT.jit64.XUnitWrapper.dll
is generated fine just like all the other wrappers.@MattGal, Is there any magic you might be able to pull off to help us better understand what’s going on, whether it’s a reliability issue of the newly brought up machines or perhaps a particular machine, and / or how to investigate that further?
The text was updated successfully, but these errors were encountered: