Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

travis: Parallelize tests on Android #41575

Merged
merged 1 commit into from
Apr 28, 2017

Conversation

alexcrichton
Copy link
Member

Currently our slowest test suite on android, run-pass, takes over 5 times longer
than the x86_64 component (~400 -> ~2200s). Typically QEMU emulation does indeed
add overhead, but not 5x for this kind of workload. One of the slowest parts of
the Android process is that compilation happens serially. Tests themselves
need to run single-threaded on the emulator (due to how the test harness works)
and this forces the compiles themselves to be single threaded.

Now Travis gives us more than one core per machine, so it'd be much better if we
could take advantage of them! The emulator itself is still fundamentally
single-threaded, but we should see a nice speedup by sending binaries for it to
run much more quickly.

It turns out that we've already got all the toos to do this in-tree. The
qemu-test-{server,client} that are in use for the ARM Linux testing are a
perfect match for the Android emulator. This commit migrates the custom adb
management code in compiletest/rustbuild to the same qemu-test-{server,client}
implementation that ARM Linux uses.

This allows us to lift the parallelism restriction on the compiletest test
suites, namely run-pass. Consequently although we'll still basically run the
tests themselves in single threaded mode we'll be able to compile all of them in
parallel, keeping the pipeline much more full hopefully and using more cores for
the work at hand. Additionally the architecture here should be a bit speedier as
it should have less overhead than adb which is a whole new process on both the
host and the emulator!

Locally on an 8 core machine I've seen the run-pass test suite speed up from
taking nearly an hour to only taking 5 minutes. I don't think we'll see quite a
drastic speedup on Travis but I'm hoping this change can place the Android tests
well below 2 hours instead of just above 2 hours.

Because the client/server here are now repurposed for more than just QEMU,
they've been renamed to remote-test-{server,client}.

Note that this PR does not currently modify how debuginfo tests are executed on
Android. While parallelizable it wouldn't be quite as easy, so that's left to
another day. Thankfull that test suite is much smaller than the run-pass test
suite.

@rust-highfive
Copy link
Collaborator

r? @aturon

(rust_highfive has picked a reviewer for you, use r? to override)

@aturon
Copy link
Member

aturon commented Apr 27, 2017

cc @TimNN @aidanhs, may be of interest to you fine folks :)

@@ -15,7 +15,7 @@
#![feature(test)]
#![feature(libc)]

#![deny(warnings)]
// #![deny(warnings)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Looks like this need to be uncommented

@alexcrichton
Copy link
Member Author

Locally a full build of the arm-android image took 1:47:24 before this change, and 0:39:56 afterwards. That's over an hour's time shaved off!

@@ -1371,6 +1371,7 @@ mod tests {
// `set` command.
assert!((cfg!(windows) && k.starts_with("=")) ||
k.starts_with("DYLD") ||
k == "PWD" ||
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this PR mean PWD needs to be excluded?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't actually know why this was needed until just now! I think what's happening is:

  1. We're shipping over all environment variables from the host to the target. One of these is PWD. This means that when this test is executed, then PWD is the same as it is on the host, but it doesn't match the target.
  2. When the test runs, it executes env
  3. On android, this looks like it uses a shell, which probably resets PWD
  4. This means that the env command on Android is printing out a different value of PWD than the parent process has (b/c we reset it manually).

Now that I actually know what's happening I'll remove this change and alter the "ship env vars to the client" code to skip PWD (it already skips a few others)

@aidanhs
Copy link
Member

aidanhs commented Apr 27, 2017

This is pretty awesome. Also, this made me laugh:

As a final fix I discovered that the ARM and Android test suites were actually running all library unit tests (e.g. stdtest, coretest, etc) twice. I've corrected that to only run tests once which should also give a nice boost in overall cycle time here.

This is the removal of find_tests I assume (which I note has also been done for emscripten). Any particular reason you opted to keep the deps subdirectory rather than out_dir itself? The latter feels slightly more intuitive to me, but I don't know if it might miss some tests.


// android debug-info test uses remote debugger
// so, we test 1 thread at once.
// also trying to isolate problems with adb_run_wrapper.sh ilooping
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adb_run_wrapper.sh doesn't seem to exist any more, so at least this line can be removed

// hopefully allow us to still identify what's running while staying under
// the filesystem limits.
let len = cmp::min(filename.len() - 1, 50);
let dst = dir.join(t!(str::from_utf8(&filename[..len])));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be being a bit dim, but why - 1? This seems like it will always truncate the last character if the length is below 50? (e.g. a filename of length 1 becomes length 0)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah that has to do with the weird protocol of the remote-test-{server,client}. The filename coming in has a trailing nul byte which we want to chop off.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So a funny story about this logic as well. I was getting a random failure in a test with a super long test name and I couldn't figure out why. If I ran only that one test it passed, but in the entire suite it failed. Turned out:

  • Running the test alone the path was 126 characters
  • Running the test after 10 other tests the path was 127 characters
  • Running the test after 100 other tests the pth was 128 characters

The test only ended up failing if it was run after 100 other tests. Sure enough each test claims itself a number and is run in a testXX directory, so after 100 tests we're in 3 character range, pushing us from 127 to 128.

Turns out Android has trouble with paths longer than 127 characters. Who knew!

All of this was because we used to run tests in /data/tmp, and I changed it in this PR to /data/tmp/work. Wheeeeeeeeeeeeeeee

Copy link
Member

@aidanhs aidanhs Apr 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's horrifying. It's the kind of thing I could imagine lingering for months as a spurious failure if tests were run in parallel and a test flipped between number 99 and 100 to run 😞

@alexcrichton
Copy link
Member Author

Any particular reason you opted to keep the deps subdirectory rather than out_dir itself? The latter feels slightly more intuitive to me, but I don't know if it might miss some tests.

Oh the deps dir I think is guaranteed to have everything, whereas the out_dir may or may not have everything. The location of all Cargo's artifacts is unknown to even me over time :). I suspect that either works for us, it's just that right now they're duplicated picking it up from both locations.

if !SILENCE_PANIC.with(|s| s.get()) {
prev(info);
}
}));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the changes to this test required for the new testing method, or is it just an unrelated improvement?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was primarily just an optimization. Without this the test generated over a megabyte of output and I saw the run-pass test suite noticeably hang at the end with the collector trying to collect all that data. This just reduces the output quite a bit so the test is a bit more speedy to run.

@alexcrichton
Copy link
Member Author

Ok updated with recent comments

@alexcrichton alexcrichton added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Apr 27, 2017
@aidanhs
Copy link
Member

aidanhs commented Apr 27, 2017

Nothing else from me (though my look at the remote test server was superficial since I first saw it today).

Keen for the speedups!

Copy link
Contributor

@TimNN TimNN left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look good to me.

t!(fs::create_dir_all(work));

let lock = Arc::new(Mutex::new(()));

for socket in listener.incoming() {
let mut socket = t!(socket);
let mut buf = [0; 4];
t!(socket.read_exact(&mut buf));
if socket.read_exact(&mut buf).is_err() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignoring the error here seems fine to me, although I'm curious why this has become necessary?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've found that the way emulators treat network connections is super flaky. The server was crashing on startup on Android for a very weird reason, which I was later able to debug fully. The behavior I saw was:

  • The client would need to push 4 dylibs initially to the server
  • All 4 pushes finished immediately
  • Oddly enough, only the last one would actually get pushed

Everything was "fixed" when I forced the client to keep the TCP connection alive until the server sent an "ack" back acknowledging it received the whole file. I have no idea why the data wasn't actually sent from the client to the server when it was all reported as successfully being sent on the client.

Given all that behavior what I think was happening was:

  • The client created a TCP connection to the server (this goes through the emulation layer)
  • The client successfully sent all data
  • The client closed the TCP connection
  • The server accepted the connection
  • Somehow the emulator decided to not send all the data on the TCP connection and reported EOF

That EOF in turn previously caused a panic here. It ended up being much easier to debug if I ignored errors here and the server got farther. In general this seemed like a good idea because of how flaky the emulated networks seemed to be...

t!(client.flush());

// Wait for an acknowledgement that all the data was received. No idea
// why this is necessary, seems like it shouldn't be!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify: without the server sending and us waiting on the "ack " the data is not properly received?

My best guess is that the immediate close of the socket after sending the data could cause problems, which seems to indeed be the case see: http://blog.csdn.net/CPP_CHEN/article/details/29864509

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, would it make sense to check if the bytes read were actually "ack "?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah I ended up describing this in a comment I just wrote, it's related to the "ignore errors on accept sorta" path as well.

Although I guess this is a facet of TCP and not of the emulated network! I'll check here that ack was sent though to be extra sure.

t!(client.write_all(&[0]));
t!(client.write_all(v.as_bytes()));
t!(client.write_all(&[0]));
match k {
Copy link
Contributor

@TimNN TimNN Apr 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs an .as_ref() or something, otherwise it causes a type mismatch (String vs &str)

Currently our slowest test suite on android, run-pass, takes over 5 times longer
than the x86_64 component (~400 -> ~2200s). Typically QEMU emulation does indeed
add overhead, but not 5x for this kind of workload. One of the slowest parts of
the Android process is that *compilation* happens serially. Tests themselves
need to run single-threaded on the emulator (due to how the test harness works)
and this forces the compiles themselves to be single threaded.

Now Travis gives us more than one core per machine, so it'd be much better if we
could take advantage of them! The emulator itself is still fundamentally
single-threaded, but we should see a nice speedup by sending binaries for it to
run much more quickly.

It turns out that we've already got all the tools to do this in-tree. The
qemu-test-{server,client} that are in use for the ARM Linux testing are a
perfect match for the Android emulator. This commit migrates the custom adb
management code in compiletest/rustbuild to the same qemu-test-{server,client}
implementation that ARM Linux uses.

This allows us to lift the parallelism restriction on the compiletest test
suites, namely run-pass. Consequently although we'll still basically run the
tests themselves in single threaded mode we'll be able to compile all of them in
parallel, keeping the pipeline much more full and using more cores for the work
at hand. Additionally the architecture here should be a bit speedier as it
should have less overhead than adb which is a whole new process on both the host
and the emulator!

Locally on an 8 core machine I've seen the run-pass test suite speed up from
taking nearly an hour to only taking 6 minutes. I don't think we'll see quite a
drastic speedup on Travis but I'm hoping this change can place the Android tests
well below 2 hours instead of just above 2 hours.

Because the client/server here are now repurposed for more than just QEMU,
they've been renamed to `remote-test-{server,client}`.

Note that this PR does not currently modify how debuginfo tests are executed on
Android. While parallelizable it wouldn't be quite as easy, so that's left to
another day. Thankfully that test suite is much smaller than the run-pass test
suite.

As a final fix I discovered that the ARM and Android test suites were actually
running all library unit tests (e.g. stdtest, coretest, etc) twice. I've
corrected that to only run tests once which should also give a nice boost in
overall cycle time here.
@alexcrichton
Copy link
Member Author

Updated!

@alexcrichton
Copy link
Member Author

@bors: r=TimNN

@bors
Copy link
Contributor

bors commented Apr 28, 2017

📌 Commit 7bc2cbf has been approved by TimNN

@bors
Copy link
Contributor

bors commented Apr 28, 2017

⌛ Testing commit 7bc2cbf with merge 80c3226...

@bors
Copy link
Contributor

bors commented Apr 28, 2017

💔 Test failed - status-appveyor

@alexcrichton
Copy link
Member Author

alexcrichton commented Apr 28, 2017 via email

@bors
Copy link
Contributor

bors commented Apr 28, 2017

⌛ Testing commit 7bc2cbf with merge ad1461e...

bors added a commit that referenced this pull request Apr 28, 2017
travis: Parallelize tests on Android

Currently our slowest test suite on android, run-pass, takes over 5 times longer
than the x86_64 component (~400 -> ~2200s). Typically QEMU emulation does indeed
add overhead, but not 5x for this kind of workload. One of the slowest parts of
the Android process is that *compilation* happens serially. Tests themselves
need to run single-threaded on the emulator (due to how the test harness works)
and this forces the compiles themselves to be single threaded.

Now Travis gives us more than one core per machine, so it'd be much better if we
could take advantage of them! The emulator itself is still fundamentally
single-threaded, but we should see a nice speedup by sending binaries for it to
run much more quickly.

It turns out that we've already got all the toos to do this in-tree. The
qemu-test-{server,client} that are in use for the ARM Linux testing are a
perfect match for the Android emulator. This commit migrates the custom adb
management code in compiletest/rustbuild to the same qemu-test-{server,client}
implementation that ARM Linux uses.

This allows us to lift the parallelism restriction on the compiletest test
suites, namely run-pass. Consequently although we'll still basically run the
tests themselves in single threaded mode we'll be able to compile all of them in
parallel, keeping the pipeline much more full hopefully and using more cores for
the work at hand. Additionally the architecture here should be a bit speedier as
it should have less overhead than adb which is a whole new process on both the
host and the emulator!

Locally on an 8 core machine I've seen the run-pass test suite speed up from
taking nearly an hour to only taking 5 minutes. I don't think we'll see quite a
drastic speedup on Travis but I'm hoping this change can place the Android tests
well below 2 hours instead of just above 2 hours.

Because the client/server here are now repurposed for more than just QEMU,
they've been renamed to `remote-test-{server,client}`.

Note that this PR does not currently modify how debuginfo tests are executed on
Android. While parallelizable it wouldn't be quite as easy, so that's left to
another day. Thankfull that test suite is much smaller than the run-pass test
suite.
@bors
Copy link
Contributor

bors commented Apr 28, 2017

☀️ Test successful - status-appveyor, status-travis
Approved by: TimNN
Pushing ad1461e to master...

@bors bors merged commit 7bc2cbf into rust-lang:master Apr 28, 2017
@aidanhs
Copy link
Member

aidanhs commented Apr 28, 2017

Nice, looks like 30mins off the build time - down to 1hr42min from 2hr15min.

@alexcrichton alexcrichton deleted the android-qemu-server branch April 28, 2017 20:02
@alexcrichton
Copy link
Member Author

Now to deal with the 2.5hr test times in OSX ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-review Status: Awaiting review from the assignee but also interested parties.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants