-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Description
For a long time, CI runs have been sporadically failing when agents stop responding. (It usually goes away after a rerun, but sometimes it takes several tries.) The error messages are of the form:
##[error]We stopped hearing from agent BUILD000382. Verify the agent machine is running and has a healthy network connection. Anything that terminates an agent process, starves it for CPU, or blocks its network access can cause this error. For more information, see: https://go.microsoft.com/fwlink/?linkid=846610
We aren't absolutely sure about the root cause, but we suspect that it's CPU starvation during our intensely multithreaded test runs. We've tried to mitigate this by using only 30 out of 32 cores:
| $testParallelism = $env:NUMBER_OF_PROCESSORS - 2 |
However, one of our tests builds with /MP, so it'll consume all available cores (I haven't checked whether agents that stop responding are highly correlated with machines that run a configuration of this test):
| exportHeaderOptions = ['/exportHeader', '/Fo', '/MP'] |
It's unclear what we should do to fix this, but it's a recurring productivity drain. Some ideas:
- Contact the Azure Pipelines team and investigate whether agents can run at High priority.
- Investigate using our "jobify" machinery so we can run the compiler and test executables at Low priority.