Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci: Collect CPU usage statistics on Azure #61632

Merged
merged 1 commit into from
Jun 12, 2019

Conversation

alexcrichton
Copy link
Member

This commit adds a script which we'll execute on Azure Pipelines which
is intended to run in the background and passively collect CPU usage
statistics for our builders. The intention here is that we can use this
information over time to diagnose issues with builders, see where we can
optimize our build, fix parallelism issues, etc. This might not end up
being too useful in the long run but it's data we've wanted to collect
for quite some time now, so here's a stab at it!

Comments about how this is intended to work can be found in the python
script used here to collect CPU usage statistics.

Closes #48828

@rust-highfive
Copy link
Collaborator

r? @nikomatsakis

(rust_highfive has picked a reviewer for you, use r? to override)

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Jun 7, 2019
@alexcrichton
Copy link
Member Author

r? @pietroalbini

Copy link
Member

@Mark-Simulacrum Mark-Simulacrum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! I haven't looked through all the platform specific details since it seems great if it works and I don't know much anyway.

# https://rust-lang-ci2.s3.amazonaws.com/rustc-builds/68baada19cd5340f05f0db15a3e16d6671609bcc/cpu-x86_64-apple.csv
#
# Each CSV file has two columns. The first is the timestamp of the measurement
# and the second column is the % of idle cpu time in that time slice. Ideally
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I presume this is the total idle % time; could we perhaps get per-thread/vCPU measurements? Fine to leave until later too, of course.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The OSX/Linux implementations have this information, but for Windows we just have total times currently. It's possible to figure this out but afaik it's not all that useful in the sense that I can't think of a meaningful statistic to learn about individual cpus vs them all as an aggregate that we can act on

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thought was a distinction between "one thread is doing work" and "all threads are busy, but none are all that active (e.g., network/IO heavy, presumably)". But yeah, I agree that at least initially not all that useful

@rust-highfive
Copy link
Collaborator

The job x86_64-gnu-llvm-6.0 of your PR failed on Travis (raw log). Through arcane magic we have determined that the following fragments from the build log may contain information about the problem.

Click to expand the log.
travis_time:end:2353748c:start=1559938028754349362,finish=1559938120034569453,duration=91280220091
$ git checkout -qf FETCH_HEAD
travis_fold:end:git.checkout

Encrypted environment variables have been removed for security reasons.
See https://docs.travis-ci.com/user/pull-requests/#pull-requests-and-security-restrictions
$ export SCCACHE_BUCKET=rust-lang-ci-sccache2
$ export SCCACHE_REGION=us-west-1
$ export GCP_CACHE_BUCKET=rust-lang-ci-cache
$ export AWS_ACCESS_KEY_ID=AKIA46X5W6CZEJZ6XT55
---

[00:25:55] travis_fold:start:tidy
travis_time:start:tidy
tidy check
[00:25:55] tidy error: /checkout/src/etc/cpu-usage-over-time.py:83: line longer than 100 chars
[00:26:00] some tidy checks failed
[00:26:00] 
[00:26:00] 
[00:26:00] command did not execute successfully: "/checkout/obj/build/x86_64-unknown-linux-gnu/stage0-tools-bin/tidy" "/checkout/src" "/checkout/obj/build/x86_64-unknown-linux-gnu/stage0/bin/cargo" "--no-vendor" "--quiet"
[00:26:00] 
[00:26:00] 
[00:26:00] failed to run: /checkout/obj/build/bootstrap/debug/bootstrap test src/tools/tidy
[00:26:00] Build completed unsuccessfully in 0:01:15
---
travis_time:end:020247fa:start=1559939726345153331,finish=1559939726350585723,duration=5432392
travis_fold:end:after_failure.3
travis_fold:start:after_failure.4
travis_time:start:21e5721a
$ ln -s . checkout && for CORE in obj/cores/core.*; do EXE=$(echo $CORE | sed 's|obj/cores/core\.[0-9]*\.!checkout!\(.*\)|\1|;y|!|/|'); if [ -f "$EXE" ]; then printf travis_fold":start:crashlog\n\033[31;1m%s\033[0m\n" "$CORE"; gdb --batch -q -c "$CORE" "$EXE" -iex 'set auto-load off' -iex 'dir src/' -iex 'set sysroot .' -ex bt -ex q; echo travis_fold":"end:crashlog; fi; done || true
travis_fold:end:after_failure.4
travis_fold:start:after_failure.5
travis_time:start:3a81a7c7
travis_time:start:3a81a7c7
$ cat ./obj/build/x86_64-unknown-linux-gnu/native/asan/build/lib/asan/clang_rt.asan-dynamic-i386.vers || true
cat: ./obj/build/x86_64-unknown-linux-gnu/native/asan/build/lib/asan/clang_rt.asan-dynamic-i386.vers: No such file or directory
travis_fold:end:after_failure.5
travis_fold:start:after_failure.6
travis_time:start:0073c904
$ dmesg | grep -i kill

I'm a bot! I can only do what humans tell me to, so if this was not helpful or you have suggestions for improvements, please ping or otherwise contact @TimNN. (Feature Requests)

@pietroalbini
Copy link
Member

Took a glance at the script and it seems fine (why is it in src/etc and not in src/ci?). One thing I'd do is to add a condition to run the upload (and the capture too?) only when the AWS secret key is present: for the current builders it's always there but it won't be present for the PR builders.

@alexcrichton alexcrichton force-pushed the azure-pipelines-cpu branch from 410b2dc to 77a1434 Compare June 10, 2019 14:23
@alexcrichton
Copy link
Member Author

Good points @pietroalbini, I think those are handled now

@Mark-Simulacrum
Copy link
Member

@bors r+

@bors
Copy link
Contributor

bors commented Jun 10, 2019

📌 Commit 77a1434cb766830277e273afa8b7a9b4089d0e41 has been approved by Mark-Simulacrum

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Jun 10, 2019
- bash: aws s3 cp --acl public-read cpu-usage.csv s3://$DEPLOY_BUCKET/rustc-builds/$BUILD_SOURCEVERSION/cpu-$SYSTEM_JOBNAME.csv
env:
AWS_SECRET_ACCESS_KEY: $(AWS_SECRET_ACCESS_KEY)
condition: ne(variables['System.PullRequest.IsFork'], 'True')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
condition: ne(variables['System.PullRequest.IsFork'], 'True')
condition: contains(variables, 'AWS_SECRET_ACCESS_KEY')

Can we make this more generic, so that it's automatically enabled or disabled based on the presence of the secret key?

@pietroalbini
Copy link
Member

@bors r- due to the above comment.

@bors bors added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. labels Jun 11, 2019
This commit adds a script which we'll execute on Azure Pipelines which
is intended to run in the background and passively collect CPU usage
statistics for our builders. The intention here is that we can use this
information over time to diagnose issues with builders, see where we can
optimize our build, fix parallelism issues, etc. This might not end up
being too useful in the long run but it's data we've wanted to collect
for quite some time now, so here's a stab at it!

Comments about how this is intended to work can be found in the python
script used here to collect CPU usage statistics.

Closes rust-lang#48828
@alexcrichton alexcrichton force-pushed the azure-pipelines-cpu branch from 77a1434 to f2c37a5 Compare June 11, 2019 13:56
@alexcrichton
Copy link
Member Author

@bors: r=pietroalbini

@bors
Copy link
Contributor

bors commented Jun 11, 2019

📌 Commit f2c37a5 has been approved by pietroalbini

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Jun 11, 2019
Centril added a commit to Centril/rust that referenced this pull request Jun 12, 2019
…=pietroalbini

ci: Collect CPU usage statistics on Azure

This commit adds a script which we'll execute on Azure Pipelines which
is intended to run in the background and passively collect CPU usage
statistics for our builders. The intention here is that we can use this
information over time to diagnose issues with builders, see where we can
optimize our build, fix parallelism issues, etc. This might not end up
being too useful in the long run but it's data we've wanted to collect
for quite some time now, so here's a stab at it!

Comments about how this is intended to work can be found in the python
script used here to collect CPU usage statistics.

Closes rust-lang#48828
bors added a commit that referenced this pull request Jun 12, 2019
Rollup of 9 pull requests

Successful merges:

 - #60187 (Generator optimization: Overlap locals that never have storage live at the same time)
 - #61348 (Implement Clone::clone_from for Option and Result)
 - #61568 (Use Symbol, Span in libfmt_macros)
 - #61632 (ci: Collect CPU usage statistics on Azure)
 - #61654 (use pattern matching for slices destructuring)
 - #61671 (implement nth_back for Range(Inclusive))
 - #61688 (is_fp and is_floating_point do the same thing, remove the former)
 - #61705 (Pass cflags rather than cxxflags to LLVM as CMAKE_C_FLAGS)
 - #61734 (Migrate rust-by-example to MdBook2)

Failed merges:

r? @ghost
@bors bors merged commit f2c37a5 into rust-lang:master Jun 12, 2019
pietroalbini added a commit to pietroalbini/rust that referenced this pull request Jun 12, 2019
The condition I suggested in rust-lang#61632 was not correct and it errors out
while evaluating. This fixes the condition.

Example of a failure:
https://dev.azure.com/rust-lang/rust/_build/results?buildId=543
bors added a commit that referenced this pull request Jun 12, 2019
ci: fix ci stats upload condition

The condition I suggested in #61632 was not correct and it errors out while evaluating. This fixes the condition. [Example of a failure](https://dev.azure.com/rust-lang/rust/_build/results?buildId=543).

r? @alexcrichton
Centril added a commit to Centril/rust that referenced this pull request Jun 12, 2019
…=alexcrichton

ci: fix ci stats upload condition

The condition I suggested in rust-lang#61632 was not correct and it errors out while evaluating. This fixes the condition. [Example of a failure](https://dev.azure.com/rust-lang/rust/_build/results?buildId=543).

r? @alexcrichton
Centril added a commit to Centril/rust that referenced this pull request Jun 12, 2019
…=alexcrichton

ci: fix ci stats upload condition

The condition I suggested in rust-lang#61632 was not correct and it errors out while evaluating. This fixes the condition. [Example of a failure](https://dev.azure.com/rust-lang/rust/_build/results?buildId=543).

r? @alexcrichton
@alexcrichton alexcrichton deleted the azure-pipelines-cpu branch July 8, 2019 20:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Collect CPU utilization statistics of CI builders
6 participants