-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
x/build/env/darwin-arm64: run buildlets in a per-build VM #48945
Comments
The state of support for virtualization of darwin/arm64 is poor. There is no solution in the near future for hosting a Mac VM on ARM64 in our current environment (VMWare on MacStadium) or in our qemu-hvm environments. This is due to a number of issues, not limited to the fact that darwin/arm64 uses an iOS like boot-process instead of a typical PC boot process. However, the AWS buildlets may work well for us. The workflow of creating a macOS instance on AWS is roughly:
This comes with some caveats. After an instance is stopped, the Dedicated Host is apparently re-imaged by AWS (in a "Pending" state for about an hour). This means, an instance per-buildlet is prohibitively expensive at our volume, as a downtime of an hour between builds would necessitate a very large pool of Dedicated Hosts for our hundreds of darwin builds per day that we do. We could avoid some of the re-image downtime issue by re-using instances between builds. This has a bit of an issue with keeping environments pristine, but we may be able to automate some of that away with APFS filesystem snapshots, and retiring the instances on some interval. Finally, for releaselets, we could ensure that we always use a fresh VM that hasn't been tainted by previous builds running on it. This is less featureful than our always-fresh image approach that we have on MacStadium / VMWare, but it would allow for a somewhat-clean environment for our darwin/ARM64 builders without too much overhead. Finally, this approach could be re-used for all of our mac builders. This would have a massive benefit of reducing the effort needed to build and maintain new macOS images for each release, which has now doubled with two processor architectures. Our existing approach for AWS AMIs using Packer should work nicely in this scenario. |
Documentation on automatic Dedicated Host provisioning and Releasing: https://docs.aws.amazon.com/license-manager/latest/userguide/host-resource-groups.html |
I'm starting prototyping with this testing out AWS Mac instances, first by creating some standard reverse builders. |
https://go.dev/cl/430696 added an AWS darwin-amd64 reverse builder ( It actually almost "just works". All of the x/ repos seem to be passing. The main Go repo fails on a specific test, This failure is oddly specific: the system /usr/bin/nohup binary is unhappy. We've had this issue before in #5135 inside of tmux. This is also a common problem online, though frustratingly I have yet to find a concrete description of the problem, just workarounds. One mention is that sshd must have PAM enabled. The AWS sshd config does indeed disable PAM, so that may be related. (Though my launchd service isn't run through ssh, so it isn't clear how that is related). I'm going to investigate running QEMU on these instances, so I'll pause investigation into |
Change https://go.dev/cl/432115 mentions this issue: |
Upcoming QEMU-based builders can shutdown in the same manner as macstadium VMs. For golang/go#48945. Change-Id: I56cf8389c542b6a52773d333ed48a585b0b9c64b Reviewed-on: https://go-review.googlesource.com/c/build/+/432115 Reviewed-by: Jenny Rakoczy <jenny@golang.org> Reviewed-by: Heschi Kreinick <heschi@google.com> Run-TryBot: Michael Pratt <mpratt@google.com> TryBot-Result: Gopher Robot <gobot@golang.org>
Change https://go.dev/cl/432396 mentions this issue: |
Change https://go.dev/cl/432395 mentions this issue: |
We now have three hosts running six reverse builder VMs fully set up and (almost) ready. There is one failing test (https://build.golang.org/log/c40b5c45d0dc28318fd9ad0149efddfe39ff27d7) because an extra deprecation warning printed by bash. Once the builds are working next steps (short and long term) will be:
|
This adds instructions and scripts for running darwin builders inside QEMU on AWS Mac instances. Only darwin-amd64-12 is tested thus far. For golang/go#48945. Change-Id: I453c269c80f5b9af0f8059f3ea59ee517654baa8 Reviewed-on: https://go-review.googlesource.com/c/build/+/432395 Reviewed-by: Jenny Rakoczy <jenny@golang.org> Reviewed-by: Heschi Kreinick <heschi@google.com>
Change https://go.dev/cl/432857 mentions this issue: |
Change https://go.dev/cl/432856 mentions this issue: |
We do not want to quote $EXTRA_ARGS, as we explicitly want them split into multiple arguments to qemu. This fixes start-installer.sh. For golang/go#48945. Change-Id: I6659f74120be534e9d909bcff85151011f8addd2 Reviewed-on: https://go-review.googlesource.com/c/build/+/432856 Run-TryBot: Michael Pratt <mpratt@google.com> Reviewed-by: Heschi Kreinick <heschi@google.com> TryBot-Result: Gopher Robot <gobot@golang.org>
All of the existing darwin-amd64 builders get AWS equivalents. For golang/go#48945. Change-Id: I076c19a47b6cc1fa1fb1be25219e6616497e1dc9 Reviewed-on: https://go-review.googlesource.com/c/build/+/432857 Run-TryBot: Michael Pratt <mpratt@google.com> Reviewed-by: Heschi Kreinick <heschi@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Jenny Rakoczy <jenny@golang.org>
Change https://go.dev/cl/432860 mentions this issue: |
Change https://go.dev/cl/432859 mentions this issue: |
race builders must have a name ending in -race. Switch the name to match that rule. For golang/go#48945. Change-Id: I69caf2e4d2e059a666d14dfe325a275e467488de Reviewed-on: https://go-review.googlesource.com/c/build/+/432859 Run-TryBot: Michael Pratt <mpratt@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Cherry Mui <cherryyz@google.com>
Different networking flag is needed for older macOS. Plus the disk numbering seems to change arbitrarily, so don't promise my examples are accurate. For golang/go#48945. Change-Id: I0e576e0c6c9e80969462a353c2e0326246ff4280 Reviewed-on: https://go-review.googlesource.com/c/build/+/432860 Auto-Submit: Michael Pratt <mpratt@google.com> Run-TryBot: Michael Pratt <mpratt@google.com> Reviewed-by: Heschi Kreinick <heschi@google.com> TryBot-Result: Gopher Robot <gobot@golang.org>
Change https://go.dev/cl/442255 mentions this issue: |
This networking mode, which uses a virtualization networking API from Apple, appears to be more stable than the default userspace networking. Multiple guests will share the same subnet, so they must have different MAC addresses. Apple's API seems to run an DNS server that serves compressed SRV records, which Go does not support (golang/go#36718) and thus causes a net test to fail. Until that is resolved, we must manually specify a custom DNS server. For golang/go#48945. Updates golang/go#36718. Change-Id: Ibc5b91ed1456d1364975febe83fd282c42bd6ed1 Reviewed-on: https://go-review.googlesource.com/c/build/+/442255 TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Jenny Rakoczy <jenny@golang.org> Run-TryBot: Michael Pratt <mpratt@google.com>
Change https://go.dev/cl/448435 mentions this issue: |
For golang/go#48945. For golang/go#55355. Change-Id: Ic8f346995024c69a3a8f2b91681e4ff90280f2c6 Reviewed-on: https://go-review.googlesource.com/c/build/+/448435 Auto-Submit: Michael Pratt <mpratt@google.com> Reviewed-by: Heschi Kreinick <heschi@google.com> Run-TryBot: Michael Pratt <mpratt@google.com> TryBot-Result: Gopher Robot <gobot@golang.org>
Change https://go.dev/cl/449877 mentions this issue: |
Change https://go.dev/cl/449876 mentions this issue: |
Change https://go.dev/cl/449875 mentions this issue: |
Change https://go.dev/cl/453956 mentions this issue: |
…h 13 The AWS builders have been up and running reliably for a while now. Switch over. Unfortunately the macOS 13 builder had -aws in its name, so we're creating a new builder and will need a backfill. So it goes. While I'm here, build 1.20 with 13 now that we have it. For golang/go#48945, golang/go#40561. Change-Id: Idaac9cfea6fe3f5e190432ecb512a453812618c7 Reviewed-on: https://go-review.googlesource.com/c/build/+/453956 Run-TryBot: Heschi Kreinick <heschi@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Dmitri Shuralyov <dmitshur@golang.org> Auto-Submit: Heschi Kreinick <heschi@google.com>
Currently, DHCP assigns a dynamic IP to each guest. This means that the parent program doesn't know the IP of the guest in order to do health checks. Add a static IP assignment based on the guest MAC address. For golang/go#48945. Change-Id: Icafa4aa5a0b87ab815c882bc8215d5932d1377c0 Reviewed-on: https://go-review.googlesource.com/c/build/+/449875 TryBot-Result: Gopher Robot <gobot@golang.org> Run-TryBot: Michael Pratt <mpratt@google.com> Reviewed-by: Heschi Kreinick <heschi@google.com>
Subsequent CLs will add support for running darwin guests. To support differences between the guests, move windows support to a new file and selected by -guest-os=windows. This CL has no functional change. Windows remains the default for compatibility with existing scripts. For golang/go#48945. Change-Id: Iec2827b216f7c6b72c53a61b125d132beb44a779 Reviewed-on: https://go-review.googlesource.com/c/build/+/449876 Reviewed-by: Heschi Kreinick <heschi@google.com> Run-TryBot: Michael Pratt <mpratt@google.com> TryBot-Result: Gopher Robot <gobot@golang.org>
This is used to replace env/darwin/aws/start-snapshot.sh on CI instances. Run like so: Buildlet 1: $ ./runqemubuildlet -guest-os=darwin -macos-version=12 -osk=<OSK VAL> -guest-index=1 -buildlet-healthz-url="http://192.168.64.101:8080/healthz" Buildlet 2: $ ./runqemubuildlet -guest-os=darwin -macos-version=12 -osk=<OSK VAL> -guest-index=2 -buildlet-healthz-url="http://192.168.64.102:8080/healthz" For golang/go#48945. Change-Id: Ic0010bcd0062c7b77d09bf2addb1297e7d116474 Reviewed-on: https://go-review.googlesource.com/c/build/+/449877 Reviewed-by: Heschi Kreinick <heschi@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Auto-Submit: Michael Pratt <mpratt@google.com> Reviewed-by: Jenny Rakoczy <jenny@golang.org> Run-TryBot: Michael Pratt <mpratt@google.com>
Change https://go.dev/cl/456055 mentions this issue: |
Change https://go.dev/cl/456042 mentions this issue: |
QEMU needs to run as root on darwin for working networking. Currently we run runqemubuildlet as a normal user, and invoke QEMU with `sudo`. Unfortunately, this means that runqemubuildlet doesn't have permission to signal QEMU when the health check fails. Instead, run runqemubuildlet as root so it has permission to signal QEMU. For golang/go#48945. Change-Id: I07bd61168462c7272db05c328eb11dcd39743f79 Reviewed-on: https://go-review.googlesource.com/c/build/+/456042 Reviewed-by: Heschi Kreinick <heschi@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Run-TryBot: Michael Pratt <mpratt@google.com>
The current macstadium count is 5, but it is tricky to support an odd number with our current setup on AWS (2 VMs per host), so use 6 instead. For golang/go#48945. Change-Id: Id00ccfcde3033722715a0299cf4f40acf1665c36 Reviewed-on: https://go-review.googlesource.com/c/build/+/432396 Reviewed-by: Heschi Kreinick <heschi@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Auto-Submit: Dmitri Shuralyov <dmitshur@golang.org> Reviewed-by: Jenny Rakoczy <jenny@golang.org> Run-TryBot: Dmitri Shuralyov <dmitshur@golang.org>
Currently our AWS darwin-amd64 builder guests take ~4 minutes to boot. This is much slower than guests on MacStadium (I'm told those were closer to 10s). 4 minute boot time is a significant drag on capacity. Subrepo tests are often much shorter than that, meaning we spend more time booting than we do running tests. I dug into this a bit yesterday:
A profile shows ~15% of all cycles spent in I sent a mail to qemu-discuss@nongnu.org about this, but I'm not sure it went through, as it doesn't appear on the mailing list archive. Anyways, a quick improvement will be to switch to 4 CPUs, which hopefully is still enough to avoid test timeouts. |
I added tracing to QEMU, and these are the VM exit reason counts during boot:
The only one here that surprises me is HLT exits. It looks like XNU may use this to enter an idle state when a more complex power management subsystem is not (yet?) available: https://github.com/apple/darwin-xnu/blob/2ff845c2e033bd0ff64b5b6aa6063a1f8f65aa32/osfmk/i386/pmCPU.c#L176 |
Change https://go.dev/cl/461775 mentions this issue: |
QEMU scales poorly with CPU count on macOS (see https://go.dev/issue/48945#issuecomment-1378850381). Reducing the CPU count from 6 to 4 reduces boot time from ~4min to ~1min40sec. For golang/go#48945. Change-Id: I63fa70938049534b8f87d30935179d99a2e8f2f6 Reviewed-on: https://go-review.googlesource.com/c/build/+/461775 Reviewed-by: Dmitri Shuralyov <dmitshur@golang.org> Reviewed-by: Carlos Amedee <carlos@golang.org> Run-TryBot: Michael Pratt <mpratt@google.com> TryBot-Result: Gopher Robot <gobot@golang.org>
For reference, buildlet wait times ( Before AWS switch (2022-10-15 through 2022-11-29):
After AWS switch (2022-11-30 through 2023-01-12):
Time running tests ( Before:
After:
Edit: these are for the Go repo only, not subrepos. |
Should we close this issue and mark it as completed? Any additional problems we find can be addressed in more specific issues if necessary. |
Closing this since it doesn't seem like there's anything in particular left. |
Change https://go.dev/cl/484746 mentions this issue: |
This reverts CL 461775. That CL failed to update cmd/runqemubuildlet, which is used for production VM execution. Thus it had no impact. The next CL will change memory, so just revert this for now rather than fixing to avoid changing two things at once. For golang/go#48945. For golang/go#59634. Change-Id: I9f1e26f2b3ffd6caca9683242a44597b7f1d8d58 Reviewed-on: https://go-review.googlesource.com/c/build/+/484746 Run-TryBot: Michael Pratt <mpratt@google.com> Auto-Submit: Michael Pratt <mpratt@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Bryan Mills <bcmills@google.com> Commit-Queue: Michael Pratt <mpratt@google.com>
Currently, the darwin-arm64 buildlets do not run in a clean VM for each test run. Now that QEMU has progressed a bit, we should try running them in a VM again.
The text was updated successfully, but these errors were encountered: