Out of disk space building on Docker NanoServer #34780

BruceForstall · 2020-04-09T18:23:10Z

stress-http and stress-ssl jobs are failing to build due to out of disk space, e.g.:

D:\a\1\s\artifacts\obj\coreclr\Linux.x64.Release\crossgen\src\utilcode\stdafx.utilcode_dac.cpp(1,1): fatal error C1085: Cannot write precompiled header file: 'D:/a/1/s/artifacts/obj/coreclr/Linux.x64.Release/crossgen/src/utilcode/Release/stdafx.utilcode_dac.pch': There is not enough space on the disk. [D:\a\1\s\artifacts\obj\coreclr\Linux.x64.Release\crossgen\src\utilcode\utilcode_dac.vcxproj]
  D:\a\1\s\artifacts\obj\coreclr\Linux.x64.Release\crossgen\src\md\runtime\Release\mdruntime_dac.lib : fatal error LNK1180: insufficient disk space to complete link [D:\a\1\s\artifacts\obj\coreclr\Linux.x64.Release\crossgen\src\md\runtime\mdruntime_dac.vcxproj]
##[error]BUILD: Error: cross-arch components build failed. Refer to the build log files for details.

https://dev.azure.com/dnceng/public/_build/results?buildId=592971&view=logs&j=2d2b3007-3c5c-5840-9bb0-2b1ea49925f3&t=abae1f68-3c73-5bff-491f-f2b908580ce6

https://dev.azure.com/dnceng/public/_build/results?buildId=592970&view=logs&j=2d2b3007-3c5c-5840-9bb0-2b1ea49925f3&t=abae1f68-3c73-5bff-491f-f2b908580ce6

@dotnet/runtime-infrastructure

The text was updated successfully, but these errors were encountered:

jaredpar · 2020-04-09T18:27:54Z

@MattGal

MattGal · 2020-04-09T19:26:48Z

Hi guys.

First I want to say I'm impressed you filled the whole disk in only 24 minutes, that's impressive. Also, there's nothing about the 2nd job that says disk space full, just a 2 hour timeout?

Anyways, I've had conversations with the MMS folks recently around similar issues in MacOS hosted pools, and I learned that the amount of free space goes up and down randomly with the payloads installed and no one actually knows what it is so they can't tell you how to keep your build from blowing up the machine.

Your options include:

Move to buildpool.windows Helix queues, they have huge disks; this is not a great time to go there with the core-reduction stuff we've done, and if you need docker on the build agent I don't think we have that combination ready to go (based on the pipeline's title, probably?)
Remove any file you like from the build agent, but only if it's hosted This is not good behavior in general and I'd recommend against it, but it's literally what AzDO suggested. I'd recommend an explicit check on the behavior to decide you're not in a Helix machine since we don't recycle machines every build (nor would you need to do this) if you might ever do so.

BruceForstall · 2020-04-14T05:54:46Z

@dotnet/ncl Can someone look at this? It's failing every run.

safern · 2020-04-14T06:56:31Z

I think we should just move these builds out of the Hosted pools as they use considerable disk space.

cc: @eiriktsarpalis

jaredpar · 2020-04-14T15:27:45Z

Agree. This is the only way we're going to get the desired reliability here.

karelz · 2020-04-14T16:26:22Z

@alnikola can you please help us here?
Why does the stress cause so much trouble?

eiriktsarpalis · 2020-04-14T20:51:26Z

FWIW this is happening when building the clr and libraries on the host machine.

karelz · 2020-04-14T21:01:19Z

Are the machines supposed to handle the build?
Do we use the right design for this workflow?

eiriktsarpalis · 2020-04-14T21:18:10Z

The failures seem to have started abruptly 10 days ago, which suggests that one change may have broken it.

…

On Tue, 14 Apr 2020, 22:01 Karel Zikmund, ***@***.***> wrote: Are the machines supposed to handle the build? Do we use the right design for this workflow? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#34780 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAVO3M4QKLU354DNXEMPPVDRMTFK5ANCNFSM4ME5ZZDA> .

safern · 2020-04-14T21:30:48Z

The failures seem to have started abruptly 10 days ago, which suggests that
one change may have broken it.

Actually the build was failing with an unknown switch for linux and the build was not marked as failed

MSBUILD : error MSB1001: Unknown switch.
Switch: -skiptests

For switch syntax, type "MSBuild -help"
�[91mBuild failed (exit code '1').
�[0mFailed to restore the optimization data package.
The command '/bin/sh -c ./src/coreclr/build.sh -release -skiptests -clang9 &&     ./libraries.sh -c $CONFIGURATION -runtimeconfiguration release' returned a non-zero code: 1

https://dev.azure.com/dnceng/public/_build/results?buildId=587364

What changed the way we build this was: 42183b1#diff-41f10863d38cf298ee01c22c64e1b53a

We normally don't use Hosted machines for our builds because of the limitation of space since our build uses around 8 GBs for the artifacts. This build is also producing docker containers which will take up space.

alnikola · 2020-04-15T13:31:04Z

MSBUILD : error MSB1001: Unknown switch. Switch: -skiptests
It's a known error which must have been fixed by #33553. Looking into it.
cc: @ViktorHofer

davidsh · 2020-04-15T15:15:33Z

Why does the stress cause so much trouble?

FWIW, the stress pipeline doesn't run automatically when PRs happen. So, any changes to overall build scripts (like removing the -skiptests argument) will break the stress pipeline.

alnikola · 2020-04-15T16:02:40Z

Apparently, I misunderstood @safern's reply. That wrong argument issue has been fixed by now, so we definitely have to move stress test build to a different agent pool.

HttpStress and SslStress tests moved off hosted pool to different queues. Note: HttpStress runs are failing but it's actual test code or prod code issue which will be investigated. Infrastructure-wise everything looks good now. Fixes #34780

BruceForstall added the area-Infrastructure-coreclr label Apr 9, 2020

BruceForstall added this to the 5.0 milestone Apr 9, 2020

Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Apr 9, 2020

BruceForstall removed the untriaged New issue has not been triaged by the area owner label Apr 9, 2020

jaredpar added the blocking-outerloop Blocking the 'runtime-coreclr outerloop' and 'runtime-libraries-coreclr outerloop' runs label Apr 9, 2020

jaredpar mentioned this issue Apr 9, 2020

SDL timeouts on build dotnet/arcade#5219

Closed

2 tasks

alnikola self-assigned this Apr 15, 2020

alnikola mentioned this issue Apr 15, 2020

Networking stress tests moved out of Hosted pool #35011

Merged

jaredpar mentioned this issue Apr 21, 2020

Test Issue Please Ignore jaredpar/runfo#5

Open

jashook mentioned this issue Apr 23, 2020

Infrastructure - Status/Health #702

Closed

alnikola closed this as completed in #35011 Jun 18, 2020

ghost locked as resolved and limited conversation to collaborators Dec 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Out of disk space building on Docker NanoServer #34780

Out of disk space building on Docker NanoServer #34780

BruceForstall commented Apr 9, 2020

jaredpar commented Apr 9, 2020

MattGal commented Apr 9, 2020

BruceForstall commented Apr 14, 2020

safern commented Apr 14, 2020

jaredpar commented Apr 14, 2020

karelz commented Apr 14, 2020

eiriktsarpalis commented Apr 14, 2020

karelz commented Apr 14, 2020

eiriktsarpalis commented Apr 14, 2020 via email

safern commented Apr 14, 2020

alnikola commented Apr 15, 2020

davidsh commented Apr 15, 2020

alnikola commented Apr 15, 2020

Out of disk space building on Docker NanoServer #34780

Out of disk space building on Docker NanoServer #34780

Comments

BruceForstall commented Apr 9, 2020

jaredpar commented Apr 9, 2020

MattGal commented Apr 9, 2020

BruceForstall commented Apr 14, 2020

safern commented Apr 14, 2020

jaredpar commented Apr 14, 2020

karelz commented Apr 14, 2020

eiriktsarpalis commented Apr 14, 2020

karelz commented Apr 14, 2020

eiriktsarpalis commented Apr 14, 2020 via email

safern commented Apr 14, 2020

alnikola commented Apr 15, 2020

davidsh commented Apr 15, 2020

alnikola commented Apr 15, 2020