Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Infrastructure tracker #40721

Closed
aturon opened this issue Mar 21, 2017 · 25 comments
Closed

Infrastructure tracker #40721

aturon opened this issue Mar 21, 2017 · 25 comments
Labels
C-tracking-issue Category: A tracking issue for an RFC or an unstable feature. metabug Issues about issues themselves ("bugs about bugs") T-infra Relevant to the infrastructure team, which will review and decide on the PR/issue.

Comments

@aturon
Copy link
Member

aturon commented Mar 21, 2017

Status

Monitoring

Known issues

  • CI: currently experiencing a variety of problems that are causing the PR queue to back up:
    • Travis outages and other problems on macOS.
      • We are reporting these to Travis as they occur, and are told they are working on it, but the problems are greatly delaying our PR testing.
    • sccache bugs
      • @alexcrichton is currently focused full-time on debugging sccache, which
        is the s3-based ccache clone written in Rust that we use to cache LLVM builds
    • Spurious build failures
      • @alexcrichton has worked hard to track these down, but would welcome help from any intrepid Rustaceans who want to take on the challenge.

Help wanted

All infrastructure issues

Easy

  • Check the PR queue for old PRs that have yet to be reviewed, and ping the reviewer on IRC or elsewhere. (Yes, you can and should do this!).

  • Check the PR queue for build failures, find the failed build, and extract out the information onto a comment on the PR.

Medium

Hard

  • Spurious build failures. These are currently extremely difficult to track down, due to the inability to reproduce locally or to log into the build machines. They are thus also very high value bugs to close. Contact @alexcrichton if you'd like to give it a shot.

Infrastructure projects

  • CI + releases. Currently set up via Travis + AppVeyor, with some additional infrastructure in Rust Central Station to monitor and control the builds.

  • Rust Central Station. Oversees CI/releases and nagbox. Set up using Docker.

  • homu. The bot behind @bors. Hooks into the above CI infrastructure to actually land PRs.

  • rfcbot. A bot for managing the FCP process of RFCs and tracking issues.

  • rusty-dash. A dashboard tracking a number of metrics for Rust and its community.

  • highfive. A bot that welcomes new contributors and randomly assigns reviewing duties.

    • Maintained by @nrc.
  • nagbot. A bot for sending email reminders to the Rust subteams about reviewing duties.

  • rustbuild. The x.py build system for the Rust compiler.

  • play. The infrastructure behind https://play.rust-lang.org/

  • perf. Performance monitoring for the Rust compiler.

@aturon aturon added A-infrastructure metabug Issues about issues themselves ("bugs about bugs") labels Mar 21, 2017
@aturon
Copy link
Member Author

aturon commented Mar 21, 2017

To folks who tend to monitor the PR queue or otherwise help out with infrastructure: for the time being, I'd like to try using this issue to centralize some tracking of what's going on with infrastructure, and ways people can get involved. Right now too much of this work is falling on too few shoulders (cough @alexcrichton cough) and we need to work on spreading it out.

If you see something amiss with any piece of infrastructure, please take a look at the status page on the top here to see if the issue is known. If it's not, open a new A-infrastructure issue and leave a comment with a link. When in doubt, leave a comment here. Similarly, if you want to help out but don't know how, leave a comment here.

cc @rust-lang/compiler @rust-lang/libs @frewsxcv @TimNN @Mark-Simulacrum @erickt @edunham @japaric @est31 @durka

@bstrie
Copy link
Contributor

bstrie commented Mar 22, 2017

perf: currently down. (TODO: provide details)

@aturon Is perf.rlo updating? At the bottom it says "Updated as of: 12/30/2016, 1:24:27 PM"

@nrc
Copy link
Member

nrc commented Mar 22, 2017

Is perf.rlo updating?

No. @Mark-Simulacrum has been working on some improvements that should get it going again.

@alexcrichton
Copy link
Member

One longstanding spurious failure is general network errors and I've opened an issue which I believe will help mitigate at least one instance of that, and help implementing it would be greatly appreciated!

@alexcrichton
Copy link
Member

It looks like OSX cycle time for i686-apple-darwin has regressed 20% recently, and unfortunately I'm not sure how to explain it :(

@frewsxcv
Copy link
Member

Looks like all appveyor builds are currently failing: #40694 (comment)

@aidanhs
Copy link
Member

aidanhs commented Mar 30, 2017

My fault :( Back to the drawing board...
Partial rollback to unblock appveyor already r+ed and the build has already got past the part that was blocked.

@frewsxcv
Copy link
Member

Seemingly unrelated to the previous couple messages, the past few attempted PRs failed with the same error message on MSYS_BITS=32 on appveyor:

= note: "gcc" "-Wl,--enable-long-section-names" "-fno-use-linker-plugin" "-Wl,--nxcompat" "-nostdlib" "-Wl,--large-address-aware" "C:\\projects\\rust\\build\\i686-pc-windows-gnu\\stage1\\lib\\rustlib\\i686-pc-windows-gnu\\lib\\crt2.o" "C:\\projects\\rust\\build\\i686-pc-windows-gnu\\stage1\\lib\\rustlib\\i686-pc-windows-gnu\\lib\\rsbegin.o" "-L" "C:\\projects\\rust\\build\\i686-pc-windows-gnu\\stage1\\lib\\rustlib\\i686-pc-windows-gnu\\lib" "C:\\projects\\rust\\build\\i686-pc-windows-gnu\\stage1-std\\i686-pc-windows-gnu\\release\\deps\\collectionstest-cf7d8872f6b0686a.0.o" "-o" "C:\\projects\\rust\\build\\i686-pc-windows-gnu\\stage1-std\\i686-pc-windows-gnu\\release\\deps\\collectionstest-cf7d8872f6b0686a.exe" "-Wl,--gc-sections" "-nodefaultlibs" "-L" "C:\\projects\\rust\\build\\i686-pc-windows-gnu\\stage1-std\\i686-pc-windows-gnu\\release\\deps" "-L" "C:\\projects\\rust\\build\\i686-pc-windows-gnu\\stage1-std\\release\\deps" "-L" "C:\\projects\\rust\\build\\i686-pc-windows-gnu\\stage1\\lib\\rustlib\\i686-pc-windows-gnu\\lib" "-Wl,-Bstatic" "-Wl,-Bdynamic" "-L" "C:\\projects\\rust\\build\\i686-pc-windows-gnu\\stage1\\lib\\rustlib\\i686-pc-windows-gnu\\lib" "-l" "test-51dd1e12bb8fc6c0" "-L" "C:\\projects\\rust\\build\\i686-pc-windows-gnu\\stage1\\lib\\rustlib\\i686-pc-windows-gnu\\lib" "-l" "term-9bb4a9959ced7ebc" "-L" "C:\\projects\\rust\\build\\i686-pc-windows-gnu\\stage1\\lib\\rustlib\\i686-pc-windows-gnu\\lib" "-l" "getopts-83c65310844a796d" "-L" "C:\\projects\\rust\\build\\i686-pc-windows-gnu\\stage1\\lib\\rustlib\\i686-pc-windows-gnu\\lib" "-l" "std-4d6881ec6132b951" "C:\\projects\\rust\\build\\i686-pc-windows-gnu\\stage1\\lib\\rustlib\\i686-pc-windows-gnu\\lib\\libcompiler_builtins-7ac5a34e9b48514f.rlib" "-l" "kernel32" "-l" "advapi32" "-l" "ws2_32" "-l" "userenv" "-l" "shell32" "-l" "gcc_eh" "-lmingwex" "-lmingw32" "-lgcc" "-lmsvcrt" "-luser32" "-lkernel32" "C:\\projects\\rust\\build\\i686-pc-windows-gnu\\stage1\\lib\\rustlib\\i686-pc-windows-gnu\\lib\\rsend.o"
  = note: collect2.exe: error: ld returned 5 exit status

@aidanhs
Copy link
Member

aidanhs commented Mar 31, 2017

Similar to something lots of people using arduino saw? The resolving PR is fairly opaque about what the actual fix was though.

@alexcrichton
Copy link
Member

@frewsxcv let's try tracking those here: #40906

It's likely related to the 4.9.3 -> 6.2.0 mingw upgrade

@Mark-Simulacrum
Copy link
Member

Going to provide a summary of the current perf.rlo situation as I know it (cc @nikomatsakis, who I've talked to about this).

The current collection infrastructure is broken, for relatively unknown reasons, and I've deemed it hard to fix and sufficiently difficult to maintain that it needed a rewrite. That work has been started here: https://github.com/Mark-Simulacrum/rustc-perf-collector. The project works for the collection side of things (though it does not upload results to github), but it has not been integrated into the HTTP server for perf.rlo. I've been meaning to devote some time to this, as I don't expect it to be all that hard, but haven't quite gotten around to it yet, and we (Niko and myself) have come up with a few potential roadblocks to getting it started.

A number of the roadblocks are discussed and summarized in this internals post.

To summarize the current situation:

  • Collection based on downloading artifacts that are built as PRs land is implemented, and ~works.
  • It's unknown exactly what we want to collect (see the internals post)
  • Frontend (perf.rlo) does not work with new collection.

What needs to be done to get perf.rlo working once more:

  • Pruning and potential additions to the benchmark suite
  • New collection infrastructure running on the benchmark server
  • perf.rlo's backend portion updated to work with the new collection output

Let me know of any questions; I'd be happy to answer them.

@aturon
Copy link
Member Author

aturon commented Apr 3, 2017

An outage related to a Centos EOL being discussed in the #rust-infra channel.

3:39 PM so typically we cache docker images on travis
3:39 PM those caches seem to have been cleared
3:39 PM so we're trying to rebuild all our docker images on each pr now
3:39 PM one image is the centos 5 image that we build releases inside of
3:39 PM x86 and x86_64 images
3:39 PM so apparently centos EOL'd a couple days ago
3:40 PM and they appear to have flat out deleted info from their servers
3:40 PM so that docker container will no longer build
3:40 PM this means that everything is frozen until we fix thtat
3:40 PM possible strategies are:
3:40 PM a) figure out how to get the image building again
3:40 PM b) figure out how to build an older glibc somewhere else
3:40 PM c) bite the bullet and increase our glibc requirement
3:41 PM obviously (c) is the easiest
3:41 PM yet it's the highest impact b/c I have no idea what would break as a result
3:41 PM I have no idea how to do (a) and (b)
3:41 PM I'm currently investigating

@aturon
Copy link
Member Author

aturon commented Apr 3, 2017

Note: the previous comment is about a general outage for the queue.

@frewsxcv
Copy link
Member

frewsxcv commented Apr 4, 2017

Regarding the previous comments, that issue has been resolved via #41045, though it (and other PRs) appear to be struggling to land because we keep hitting the three hour mark on Travis.

@mrhota
Copy link
Contributor

mrhota commented Apr 4, 2017

every time we retry, everything starts over from the beginning, as if none of the prelim build prep, llvm build, and other not-actually-rustc-building happened

there has to be a better way

@aidanhs
Copy link
Member

aidanhs commented Apr 5, 2017

https://travis-ci.org/rust-lang/rust/builds/219024973 if you look at the two osx builders that took >2hr30min (!) you'll see that the logs have been truncated. When it was still building, opening the page got the truncated log, then new logs (e.g. test output) was streamed to me. Refreshing truncated it back down again.

Didn't cause a build failure, but maybe worth being aware of - I can imagine this being very annoying if the build had failed with truncated logs.

@alexcrichton
Copy link
Member

@aidanhs yeah I've found the output to sometimes be confusing on Travis. The raw logs at least appear to not be truncated?

Note that we've got a separate issue for how slow OSX is

@aidanhs
Copy link
Member

aidanhs commented Apr 5, 2017

Odd, I definitely checked that and they were truncated too (or I wouldn't have mentioned it). I guess it was a blip that corrected itself, which is a relief.

@alexcrichton
Copy link
Member

Heh I think I've definitely noticed that as well before, it sometimes just corrects itself ...

@larsbergstrom
Copy link
Contributor

@nrc Is there any desire to merge your changes to highfive into the upstream servo/highfive repo? We've done a lot of work since your fork, and AFAIK there aren't any things in yours that were overly rustaecous and non-upstreamable.

@nrc
Copy link
Member

nrc commented Apr 6, 2017

@larsbergstrom I haven't looked at the Servo highfive for ages, but the last time I checked, the two had diverged considerably and merging would be very non-trivial. I have nothing against doing so, but it seems pretty low priority and is quite a lot of work so I can't see it actually happening.

@frewsxcv
Copy link
Member

frewsxcv commented Apr 11, 2017

It looks like the CentOS 'vault' no longer has packages related to CentOS 5, so our builds will fail until we find a resolution:

@frewsxcv
Copy link
Member

To follow up with my previous comment, it turns out we were using the wrong 'vault' URL path. The fix is in #41231.

@Mark-Simulacrum Mark-Simulacrum added T-infra Relevant to the infrastructure team, which will review and decide on the PR/issue. and removed A-infrastructure labels Jun 25, 2017
@Mark-Simulacrum Mark-Simulacrum added the C-tracking-issue Category: A tracking issue for an RFC or an unstable feature. label Jul 27, 2017
@steveklabnik
Copy link
Member

Triage ping. Not sure if this issue is still valid or worth it.

@Mark-Simulacrum
Copy link
Member

Nominating for infra team discussion; I personally support closing this issue -- I don't think a tracking issue like this adds much to our work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-tracking-issue Category: A tracking issue for an RFC or an unstable feature. metabug Issues about issues themselves ("bugs about bugs") T-infra Relevant to the infrastructure team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

10 participants