Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.
/ corefx Public archive

Disable most networking tests for default CI runs #11341

Merged
merged 1 commit into from
Sep 1, 2016
Merged

Disable most networking tests for default CI runs #11341

merged 1 commit into from
Sep 1, 2016

Conversation

davidsh
Copy link
Contributor

@davidsh davidsh commented Sep 1, 2016

Lately, the CI system and test server infrastructure has been overloaded with many PRs. This is causing many networking tests to randomly fail due to resource exhaustion caused by so many simultaneous PR runs on the same CI machines. This occurrs not only when using an external test server endpoint (corefx-net.cloudapp.net) but also loopback servers.

To improve engineering efficiency for the team, we are moving most of the tests in the networking libraries from the default CI runs (innerloop). For developers working on networking libraries, the networking tests can be run by invoking the "outerloop" tests.

We will not be allowing any merges of PRs in the networking libraries unless the outerloop tests are also run by the developer.

This affects tests in the following libraries right now:

  • System.Net.Http
  • System.Net.Http.WinHttpHandler
  • System.Net.Sockets
  • System.Net.Security
  • System.Net.WebSockets.Client

I will be working on fine tuning these exclusions (issue dotnet/runtime#18406) to try to bring back tests into the innerloop runs when we have begun to address the infrastructure issues.

@davidsh
Copy link
Contributor Author

davidsh commented Sep 1, 2016

@stephentoub
Copy link
Member

Thanks, David! I'm glad we're taking action on this. I do have a few high-level concerns with the approach, though:

  • This approach means that the networking test code isn't even compiled by default.
  • This approach means we can't run the networking tests in outerloop (unless you plan to also modify the CI configuration so that outerloop runs set this msbuild property)
  • But worst, this approach means that it's unlikely a dev changing networking code/tests will run them on all relevant platforms (e.g. every time you change a networking test, you'd need to get an OSX, Ubuntu, CentOS, etc. machine configured to run it before merging).

Could we just mark all of the networking tests as outerloop instead? The networking test code would still be compiled in all legs. The tests would still run in CI on all relevant OSes, just not by default as part of every PR. And this would enable PRs specific to networking to use the "Test ... please" commands to Jenkins to add those outerloop legs to relevant PRs, such that we're not just reliant on the developer doing it on their own machines (all flavors).

@weshaggard
Copy link
Member

I agree with @stephentoub's advice. That would mean that all these would still run as part of our existing nightly test runs as well, and I really like that fact that anyone can trigger these during a PR if we suspect we need the extra validation.

@davidsh
Copy link
Contributor Author

davidsh commented Sep 1, 2016

This approach means we can't run the networking tests in outerloop (unless you plan to also modify the CI configuration so that outerloop runs set this msbuild property)

I was hoping that the CI system could be modified so that "outerloop" CI test pass also added the "/p:EnableAllNetworkingTests" option.

I was also trying to find a way to avoid touch several hundred tests across dozens of source files.

But it looks like I"ll need to sprinkle these lines in the test files now based on this feedback:

[outerloop] // TODO: Issue #11345

@stephentoub
Copy link
Member

I was also trying to find a way to avoid touch several hundred tests across dozens of source files. But it looks like I"ll need to sprinkle these lines in the test files

You should be able to apply [Outerloop] to the test class rather than to the individual test methods. You'd still need to touch dozens of files, but with one line in each.

Lately, the CI system and test server infrastructure has been overloaded with many PRs. This is causing many networking tests to randomly fail due to resource exhaustion caused by so many simultaneous PR runs on the same CI machines. This occurrs not only when using an external test server endpoint (corefx-net.cloudapp.net) but also loopback servers.

To improve engineering efficiency for the team, we are moving most of the tests in the networking libraries from the default CI runs (innerloop). For developers working on networking libraries, the networking tests can be run by invoking the "outerloop" tests.

We will not be allowing any merges of PRs in the networking libraries unless the outerloop tests are also run by the developer.

This affects tests in the following libraries right now:

* System.Net.Http
* System.Net.Http.WinHttpHandler
* System.Net.Sockets
* System.Net.Security
* System.Net.WebSockets.Client

I will be working on fine tuning these exclusions (issue #11345) to try to bring back tests into the innerloop runs when we have begun to address the infrastructure issues.
@davidsh
Copy link
Contributor Author

davidsh commented Sep 1, 2016

I decided to just touch each test method with [OuterLoop]. That will give the most granularity for the future.

I revised the PR as per feedback.

@stephentoub
Copy link
Member

I decided to just touch each test method with [OuterLoop]

Ok.

Thanks. LGTM.

@stephentoub
Copy link
Member

Test Innerloop CentOS7.1 Debug Build and Test please (known AV in coreclr)

@stephentoub
Copy link
Member

Test Outerloop Windows_NT Debug please
Test Outerloop OSX Debug please
Test Outerloop Ubuntu14.04 Debug please
Test Outerloop CentOS7.1 Debug please

@CIPop
Copy link
Member

CIPop commented Sep 1, 2016

@Petermarcu @weshaggard @stephentoub Is there any way to selectively run CI Outerloop tests based on code that is modified? (If System.Net* code is changed, Outerloop is automatically run for the respective contracts.)

@stephentoub
Copy link
Member

Is there any way to selectively run CI Outerloop tests based on code that is modified

@mmitche would know. I believe at present it needs to be done manually as I did, but he may know of some kind of directory-based triggers that can be put in place. It's a good idea.

@mmitche
Copy link
Member

mmitche commented Sep 1, 2016

There is not. However, the right way to implement this is with the tooling @jaredpar using for Roslyn, which can run tests based on the binary differences.

@jaredpar
Copy link
Member

jaredpar commented Sep 1, 2016

If we can moeve CoreFX to a deterministic build it would be easy to hook it into our suites for only running tests when they actually change. Last I checked the use of al.exe was the only blocker in doing this.

@Petermarcu
Copy link
Member

Definitely something we need to move to. @davidsh , thanks for getting this turned around so quickly.

@davidsh
Copy link
Contributor Author

davidsh commented Sep 1, 2016

Unrelated CI Failure. Going to merge this now.

http://dotnet-ci.cloudapp.net/job/dotnet_corefx/job/master/job/outerloop_centos7.1_debug_prtest/1/

System.IO.Tests.FileSystemWatcherTests.FileSystemWatcher_EnableRaisingEvents (from (empty))

Failing for the past 1 build (Since Failed#1 )
Took 1 ms.
Stacktrace

MESSAGE:
System.IO.IOException : The configured user limit (128) on the number of inotify instances has been reached.
+++++++++++++++++++
STACK TRACE:
at System.IO.FileSystemWatcher.StartRaisingEvents() at System.IO.FileSystemWatcher.StartRaisingEventsIfNotDisposed() at System.IO.FileSystemWatcher.set_EnableRaisingEvents(Boolean value) at System.IO.Tests.FileSystemWatcherTests.FileSystemWatcher_EnableRaisingEvents()

@davidsh davidsh merged commit 0d3ba29 into dotnet:master Sep 1, 2016
@davidsh davidsh deleted the networking_tests_1 branch September 1, 2016 20:59
@stephentoub
Copy link
Member

This is causing many networking tests to randomly fail due to resource exhaustion caused by so many simultaneous PR runs on the same CI machines. This occurrs not only when using an external test server endpoint (corefx-net.cloudapp.net) but also loopback servers.

@davidsh, how did you determine resource exhaustion is the cause here? For the loopback server case, does that mean we're running out of ports or something?

@mmitche
Copy link
Member

mmitche commented Sep 2, 2016

@davidsh We only run a single PR on a single machine at a time (except on mac). How would that affect the loopback server?

@davidsh
Copy link
Contributor Author

davidsh commented Sep 2, 2016

We only have a few theories at this point based on the error messages which are similar between using loopback server and real endpoint. Most of the messages are "timeout" related. There are probably multiple problems at work here including network congestion, etc.

The decision was made by @Petermarcu to move the tests to outerloop to get the engineering team more efficient. A more detailed tracing analysis will need to be done on the CI machines to get deeper understanding into the root cause(s).

picenka21 pushed a commit to picenka21/runtime that referenced this pull request Feb 18, 2022
Disable most networking tests for default CI runs

Commit migrated from dotnet/corefx@0d3ba29
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants