Frequent sporadic CI failures due to agents not responding

For a long time, CI runs have been sporadically failing when agents stop responding. (It usually goes away after a rerun, but sometimes it takes several tries.) The error messages are of the form:

> ##[error]We stopped hearing from agent BUILD000382. Verify the agent machine is running and has a healthy network connection. Anything that terminates an agent process, starves it for CPU, or blocks its network access can cause this error. For more information, see: https://go.microsoft.com/fwlink/?linkid=846610

We aren't absolutely sure about the root cause, but we suspect that it's CPU starvation during our intensely multithreaded test runs. We've tried to mitigate this by using only 30 out of 32 cores:

https://github.com/microsoft/STL/blob/c7b14059332afcc512ba245390bb0acf5bbbb0ed/azure-devops/cmake-configure-build.yml#L27

However, one of our tests builds with `/MP`, so it'll consume all available cores (I haven't checked whether agents that stop responding are highly correlated with machines that run a configuration of this test):

https://github.com/microsoft/STL/blob/c7b14059332afcc512ba245390bb0acf5bbbb0ed/tests/std/tests/P1502R1_standard_library_header_units/custom_format.py#L104

It's unclear what we should do to fix this, but it's a recurring productivity drain. Some ideas:

* Contact the Azure Pipelines team and investigate whether agents can run at High priority.
* Investigate using our "jobify" machinery so we can run the compiler and test executables at Low priority.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Frequent sporadic CI failures due to agents not responding #1617

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Frequent sporadic CI failures due to agents not responding #1617

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions