Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/build/env/darwin-arm64: run buildlets in a per-build VM #48945

Closed
toothrot opened this issue Oct 13, 2021 · 27 comments
Closed

x/build/env/darwin-arm64: run buildlets in a per-build VM #48945

toothrot opened this issue Oct 13, 2021 · 27 comments
Assignees
Labels
Builders x/build issues (builders, bots, dashboards) FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@toothrot
Copy link
Contributor

Currently, the darwin-arm64 buildlets do not run in a clean VM for each test run. Now that QEMU has progressed a bit, we should try running them in a VM again.

@toothrot toothrot added the NeedsFix The path to resolution is known, but the work has not been done. label Oct 13, 2021
@toothrot toothrot self-assigned this Oct 13, 2021
@gopherbot gopherbot added the Builders x/build issues (builders, bots, dashboards) label Oct 13, 2021
@gopherbot gopherbot added this to the Unreleased milestone Oct 13, 2021
@toothrot
Copy link
Contributor Author

toothrot commented Dec 3, 2021

The state of support for virtualization of darwin/arm64 is poor. There is no solution in the near future for hosting a Mac VM on ARM64 in our current environment (VMWare on MacStadium) or in our qemu-hvm environments. This is due to a number of issues, not limited to the fact that darwin/arm64 uses an iOS like boot-process instead of a typical PC boot process.

However, the AWS buildlets may work well for us. The workflow of creating a macOS instance on AWS is roughly:

  • purchase 24-hr leases of "Dedicated Hosts", marked as auto-assignable
  • Create mac1.metal VMs that get auto-assigned to the host pool

This comes with some caveats. After an instance is stopped, the Dedicated Host is apparently re-imaged by AWS (in a "Pending" state for about an hour). This means, an instance per-buildlet is prohibitively expensive at our volume, as a downtime of an hour between builds would necessitate a very large pool of Dedicated Hosts for our hundreds of darwin builds per day that we do.

We could avoid some of the re-image downtime issue by re-using instances between builds. This has a bit of an issue with keeping environments pristine, but we may be able to automate some of that away with APFS filesystem snapshots, and retiring the instances on some interval.

Finally, for releaselets, we could ensure that we always use a fresh VM that hasn't been tainted by previous builds running on it.

This is less featureful than our always-fresh image approach that we have on MacStadium / VMWare, but it would allow for a somewhat-clean environment for our darwin/ARM64 builders without too much overhead.

Finally, this approach could be re-used for all of our mac builders. This would have a massive benefit of reducing the effort needed to build and maintain new macOS images for each release, which has now doubled with two processor architectures. Our existing approach for AWS AMIs using Packer should work nicely in this scenario.

@toothrot
Copy link
Contributor Author

toothrot commented Dec 3, 2021

Documentation on automatic Dedicated Host provisioning and Releasing: https://docs.aws.amazon.com/license-manager/latest/userguide/host-resource-groups.html

@prattmic prattmic self-assigned this Sep 13, 2022
@prattmic
Copy link
Member

I'm starting prototyping with this testing out AWS Mac instances, first by creating some standard reverse builders.

@prattmic
Copy link
Member

prattmic commented Sep 14, 2022

https://go.dev/cl/430696 added an AWS darwin-amd64 reverse builder (darwin-amd64-12-aws), which I've set up manually with a launchd service.

It actually almost "just works". All of the x/ repos seem to be passing. The main Go repo fails on a specific test, os/signal.TestDetectNohup and TestNohup: https://build.golang.org/log/bee053b56d2f0d612725e4427f5640cdad5cad34

This failure is oddly specific: the system /usr/bin/nohup binary is unhappy. We've had this issue before in #5135 inside of tmux. This is also a common problem online, though frustratingly I have yet to find a concrete description of the problem, just workarounds.

One mention is that sshd must have PAM enabled. The AWS sshd config does indeed disable PAM, so that may be related. (Though my launchd service isn't run through ssh, so it isn't clear how that is related).

I'm going to investigate running QEMU on these instances, so I'll pause investigation into os/signal for now, since it may just not be an issue in QEMU guests.

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/432115 mentions this issue: cmd/buildet: allow halt of macOS QEMU VMs

gopherbot pushed a commit to golang/build that referenced this issue Sep 20, 2022
Upcoming QEMU-based builders can shutdown in the same manner as
macstadium VMs.

For golang/go#48945.

Change-Id: I56cf8389c542b6a52773d333ed48a585b0b9c64b
Reviewed-on: https://go-review.googlesource.com/c/build/+/432115
Reviewed-by: Jenny Rakoczy <jenny@golang.org>
Reviewed-by: Heschi Kreinick <heschi@google.com>
Run-TryBot: Michael Pratt <mpratt@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/432396 mentions this issue: dashboard: increase hast-darwin-amd64-12-aws count

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/432395 mentions this issue: env/darwin: AWS darwin instances

@prattmic
Copy link
Member

prattmic commented Sep 21, 2022

We now have three hosts running six reverse builder VMs fully set up and (almost) ready.

There is one failing test (https://build.golang.org/log/c40b5c45d0dc28318fd9ad0149efddfe39ff27d7) because an extra deprecation warning printed by bash.

Once the builds are working next steps (short and long term) will be:

  • Create instances for older amd64 macOS releases.
  • Create instances for arm64 macOS releases (if desired).
  • Investigating putting instances behind NAT. Currently they have public IPs, but all inbound connections blocked.
  • Investigate smarter scheduling like makemac. Right now we just create guests in a loop with a fixed guest OS version.

gopherbot pushed a commit to golang/build that referenced this issue Sep 22, 2022
This adds instructions and scripts for running darwin builders inside
QEMU on AWS Mac instances. Only darwin-amd64-12 is tested thus far.

For golang/go#48945.

Change-Id: I453c269c80f5b9af0f8059f3ea59ee517654baa8
Reviewed-on: https://go-review.googlesource.com/c/build/+/432395
Reviewed-by: Jenny Rakoczy <jenny@golang.org>
Reviewed-by: Heschi Kreinick <heschi@google.com>
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/432857 mentions this issue: dashboard: add all AWS darwin-amd64 builders

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/432856 mentions this issue: env/darwin/aws: don't quote extra args

gopherbot pushed a commit to golang/build that referenced this issue Sep 22, 2022
We do not want to quote $EXTRA_ARGS, as we explicitly want them split
into multiple arguments to qemu. This fixes start-installer.sh.

For golang/go#48945.

Change-Id: I6659f74120be534e9d909bcff85151011f8addd2
Reviewed-on: https://go-review.googlesource.com/c/build/+/432856
Run-TryBot: Michael Pratt <mpratt@google.com>
Reviewed-by: Heschi Kreinick <heschi@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
gopherbot pushed a commit to golang/build that referenced this issue Sep 22, 2022
All of the existing darwin-amd64 builders get AWS equivalents.

For golang/go#48945.

Change-Id: I076c19a47b6cc1fa1fb1be25219e6616497e1dc9
Reviewed-on: https://go-review.googlesource.com/c/build/+/432857
Run-TryBot: Michael Pratt <mpratt@google.com>
Reviewed-by: Heschi Kreinick <heschi@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Jenny Rakoczy <jenny@golang.org>
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/432860 mentions this issue: env/darwin/aws: update docs

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/432859 mentions this issue: dashboard: make darwin-amd64-aws race builder actually run race

gopherbot pushed a commit to golang/build that referenced this issue Sep 26, 2022
race builders must have a name ending in -race. Switch the name to match
that rule.

For golang/go#48945.

Change-Id: I69caf2e4d2e059a666d14dfe325a275e467488de
Reviewed-on: https://go-review.googlesource.com/c/build/+/432859
Run-TryBot: Michael Pratt <mpratt@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Cherry Mui <cherryyz@google.com>
gopherbot pushed a commit to golang/build that referenced this issue Sep 26, 2022
Different networking flag is needed for older macOS. Plus the disk
numbering seems to change arbitrarily, so don't promise my examples are
accurate.

For golang/go#48945.

Change-Id: I0e576e0c6c9e80969462a353c2e0326246ff4280
Reviewed-on: https://go-review.googlesource.com/c/build/+/432860
Auto-Submit: Michael Pratt <mpratt@google.com>
Run-TryBot: Michael Pratt <mpratt@google.com>
Reviewed-by: Heschi Kreinick <heschi@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
@heschi heschi moved this to In Progress in Go Release Sep 27, 2022
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/442255 mentions this issue: env/darwin/aws: switch to vmnet-shared networking

gopherbot pushed a commit to golang/build that referenced this issue Oct 11, 2022
This networking mode, which uses a virtualization networking API from
Apple, appears to be more stable than the default userspace networking.

Multiple guests will share the same subnet, so they must have different
MAC addresses.

Apple's API seems to run an DNS server that serves compressed SRV
records, which Go does not support (golang/go#36718) and thus causes a
net test to fail. Until that is resolved, we must manually specify a
custom DNS server.

For golang/go#48945.
Updates golang/go#36718.

Change-Id: Ibc5b91ed1456d1364975febe83fd282c42bd6ed1
Reviewed-on: https://go-review.googlesource.com/c/build/+/442255
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Jenny Rakoczy <jenny@golang.org>
Run-TryBot: Michael Pratt <mpratt@google.com>
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/448435 mentions this issue: dashboard: add darwin 13 (Ventura) amd64 builders on AWS

gopherbot pushed a commit to golang/build that referenced this issue Nov 7, 2022
For golang/go#48945.
For golang/go#55355.

Change-Id: Ic8f346995024c69a3a8f2b91681e4ff90280f2c6
Reviewed-on: https://go-review.googlesource.com/c/build/+/448435
Auto-Submit: Michael Pratt <mpratt@google.com>
Reviewed-by: Heschi Kreinick <heschi@google.com>
Run-TryBot: Michael Pratt <mpratt@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/449877 mentions this issue: cmd/runqemubuildlet: add darwin support

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/449876 mentions this issue: cmd/runqemubuildlet: select windows support with a flag

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/449875 mentions this issue: env/darwin/aws: assign static IPs to each guest

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/453956 mentions this issue: dashboard,internal/releasetargets: run AMD64 Macs AWS, build 1.20 with 13

gopherbot pushed a commit to golang/build that referenced this issue Nov 29, 2022
…h 13

The AWS builders have been up and running reliably for a while now.
Switch over.

Unfortunately the macOS 13 builder had -aws in its name, so we're
creating a new builder and will need a backfill. So it goes.

While I'm here, build 1.20 with 13 now that we have it.

For golang/go#48945, golang/go#40561.

Change-Id: Idaac9cfea6fe3f5e190432ecb512a453812618c7
Reviewed-on: https://go-review.googlesource.com/c/build/+/453956
Run-TryBot: Heschi Kreinick <heschi@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Dmitri Shuralyov <dmitshur@golang.org>
Auto-Submit: Heschi Kreinick <heschi@google.com>
gopherbot pushed a commit to golang/build that referenced this issue Dec 5, 2022
Currently, DHCP assigns a dynamic IP to each guest. This means that the
parent program doesn't know the IP of the guest in order to do
health checks.

Add a static IP assignment based on the guest MAC address.

For golang/go#48945.

Change-Id: Icafa4aa5a0b87ab815c882bc8215d5932d1377c0
Reviewed-on: https://go-review.googlesource.com/c/build/+/449875
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Michael Pratt <mpratt@google.com>
Reviewed-by: Heschi Kreinick <heschi@google.com>
gopherbot pushed a commit to golang/build that referenced this issue Dec 5, 2022
Subsequent CLs will add support for running darwin guests. To support
differences between the guests, move windows support to a new file and
selected by -guest-os=windows.

This CL has no functional change. Windows remains the default for
compatibility with existing scripts.

For golang/go#48945.

Change-Id: Iec2827b216f7c6b72c53a61b125d132beb44a779
Reviewed-on: https://go-review.googlesource.com/c/build/+/449876
Reviewed-by: Heschi Kreinick <heschi@google.com>
Run-TryBot: Michael Pratt <mpratt@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
gopherbot pushed a commit to golang/build that referenced this issue Dec 5, 2022
This is used to replace env/darwin/aws/start-snapshot.sh on CI
instances.

Run like so:

Buildlet 1:
$ ./runqemubuildlet -guest-os=darwin -macos-version=12 -osk=<OSK VAL> -guest-index=1 -buildlet-healthz-url="http://192.168.64.101:8080/healthz"

Buildlet 2:
$ ./runqemubuildlet -guest-os=darwin -macos-version=12 -osk=<OSK VAL> -guest-index=2 -buildlet-healthz-url="http://192.168.64.102:8080/healthz"

For golang/go#48945.

Change-Id: Ic0010bcd0062c7b77d09bf2addb1297e7d116474
Reviewed-on: https://go-review.googlesource.com/c/build/+/449877
Reviewed-by: Heschi Kreinick <heschi@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Auto-Submit: Michael Pratt <mpratt@google.com>
Reviewed-by: Jenny Rakoczy <jenny@golang.org>
Run-TryBot: Michael Pratt <mpratt@google.com>
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/456055 mentions this issue: cmd/runqemubuildlet: use sudo kill to signal on darwin

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/456042 mentions this issue: cmd/runqemubuildlet: run as root on darwin

gopherbot pushed a commit to golang/build that referenced this issue Dec 9, 2022
QEMU needs to run as root on darwin for working networking. Currently we
run runqemubuildlet as a normal user, and invoke QEMU with `sudo`.
Unfortunately, this means that runqemubuildlet doesn't have permission
to signal QEMU when the health check fails.

Instead, run runqemubuildlet as root so it has permission to signal
QEMU.

For golang/go#48945.

Change-Id: I07bd61168462c7272db05c328eb11dcd39743f79
Reviewed-on: https://go-review.googlesource.com/c/build/+/456042
Reviewed-by: Heschi Kreinick <heschi@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Michael Pratt <mpratt@google.com>
gopherbot pushed a commit to golang/build that referenced this issue Dec 29, 2022
The current macstadium count is 5, but it is tricky to support an odd
number with our current setup on AWS (2 VMs per host), so use 6 instead.

For golang/go#48945.

Change-Id: Id00ccfcde3033722715a0299cf4f40acf1665c36
Reviewed-on: https://go-review.googlesource.com/c/build/+/432396
Reviewed-by: Heschi Kreinick <heschi@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Auto-Submit: Dmitri Shuralyov <dmitshur@golang.org>
Reviewed-by: Jenny Rakoczy <jenny@golang.org>
Run-TryBot: Dmitri Shuralyov <dmitshur@golang.org>
@prattmic
Copy link
Member

Currently our AWS darwin-amd64 builder guests take ~4 minutes to boot. This is much slower than guests on MacStadium (I'm told those were closer to 10s). 4 minute boot time is a significant drag on capacity. Subrepo tests are often much shorter than that, meaning we spend more time booting than we do running tests.

I dug into this a bit yesterday:

  • Disk performance does not seem to be a bottleneck. Moving the disk image from the root EBS volume we currently use to the internal Mac SSD had no impact.
  • QEMU's hvf (i.e., macOS) backend seems to have scalability problems w.r.t. number of guest CPUs. Boot times with different CPU counts (we currently use 6):
    • 1 CPU -> ~1m30s
    • 2 CPU -> ~1m15s
    • 4 CPU -> ~1m40s
    • 6 CPU -> ~4m
    • 8 CPU -> ~6m

image

A profile shows ~15% of all cycles spent in hvf_vcpu_exec ->
qemu_mutex_lock_iothread / qemu_mutex_unlock_iothread, which sounds like lock contention to me. Indeed, this lock is pretty much held unconditionally for the duration
of all VM exits:
https://gitlab.com/qemu-project/qemu/-/blob/master/target/i386/hvf/hvf.c#L453. OTOH, the KVM backend avoids taking this lock for many (but not all) VM exit reasons.

I sent a mail to qemu-discuss@nongnu.org about this, but I'm not sure it went through, as it doesn't appear on the mailing list archive.

Anyways, a quick improvement will be to switch to 4 CPUs, which hopefully is still enough to avoid test timeouts.

@prattmic
Copy link
Member

I added tracing to QEMU, and these are the VM exit reason counts during boot:

25338097  hvf_vcpu_exit: exit reason 48 (EPT violation)
1465860   hvf_vcpu_exit: exit reason 7  (Interrupt window)
955636    hvf_vcpu_exit: exit reason 1  (External interrupt)
532542    hvf_vcpu_exit: exit reason 12 (HLT instruction)
80699     hvf_vcpu_exit: exit reason 30 (IO instruction)
3485      hvf_vcpu_exit: exit reason 10 (CPUID)
1597      hvf_vcpu_exit: exit reason 31 (RDMSR)
117       hvf_vcpu_exit: exit reason 28 (CR access)
69        hvf_vcpu_exit: exit reason 32 (WRMSR)
7         hvf_vcpu_exit: exit reason 55 (XSETBV)

The only one here that surprises me is HLT exits. It looks like XNU may use this to enter an idle state when a more complex power management subsystem is not (yet?) available: https://github.com/apple/darwin-xnu/blob/2ff845c2e033bd0ff64b5b6aa6063a1f8f65aa32/osfmk/i386/pmCPU.c#L176

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/461775 mentions this issue: env/darwin/aws: reduce guest CPU count to 4

gopherbot pushed a commit to golang/build that referenced this issue Jan 12, 2023
QEMU scales poorly with CPU count on macOS (see
https://go.dev/issue/48945#issuecomment-1378850381). Reducing the CPU
count from 6 to 4 reduces boot time from ~4min to ~1min40sec.

For golang/go#48945.

Change-Id: I63fa70938049534b8f87d30935179d99a2e8f2f6
Reviewed-on: https://go-review.googlesource.com/c/build/+/461775
Reviewed-by: Dmitri Shuralyov <dmitshur@golang.org>
Reviewed-by: Carlos Amedee <carlos@golang.org>
Run-TryBot: Michael Pratt <mpratt@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
@prattmic
Copy link
Member

prattmic commented Jan 12, 2023

For reference, buildlet wait times (get_buildlet, seconds):

Before AWS switch (2022-10-15 through 2022-11-29):

Builder p10 p50 p90 p99
darwin-amd64-10_14 0.035087506 57.426122709 5409.986448872 27029.738440022
darwin-amd64-10_15 0.035315248 51.394408839 6330.141066737 33205.196756675
darwin-amd64-11_0 0.03502191 17.052950941 1265.639361792 18768.733989156
darwin-amd64-12_0 0.036329005 113.757215006 27884.264082743 60902.939780661
darwin-amd64-nocgo 0.036361308 152.538390709 33514.425559542 60922.292660581

After AWS switch (2022-11-30 through 2023-01-12):

Builder p10 p50 p90 p99
darwin-amd64-10_14 0.030347397 129.170084819 16868.565147715 51138.059929517
darwin-amd64-10_15 0.032052233 165.560549426 21411.865571222 53629.770950752
darwin-amd64-11_0 0.03205239 221.267568342 26238.450715285 64911.760616257
darwin-amd64-12_0 0.03138795 175.994555331 25429.052805333 67438.227785737
darwin-amd64-13 0.038021979 4120.744229468 370565.129423636 429429.994737281
darwin-amd64-nocgo 0.031963242 294.516531008 36972.023496533 66837.479161523

Time running tests (make_and_test, seconds):

Before:

Builder p10 p50 p90 p99
darwin-amd64-10_14 1109.337714283 1338.203048197 1438.635656949 1496.908947744
darwin-amd64-10_15 1142.797376231 1345.846128948 1442.504664356 1515.9424281
darwin-amd64-11_0 1321.768913047 1532.862195206 1646.776306831 1738.837314762
darwin-amd64-12_0 1224.030957425 1502.473005047 1706.57681256 1789.397041412
darwin-amd64-nocgo 832.869971342 1125.187291301 1314.866512538 1381.113436051

After:

Builder p10 p50 p90 p99
darwin-amd64-10_14 1026.817217975 1941.302826298 2523.380160588 2841.733411426
darwin-amd64-10_15 1050.692744331 1811.730996635 2426.353868054 2724.949244972
darwin-amd64-11_0 2188.55871373 2517.004398094 2960.169886024 3044.606037487
darwin-amd64-12_0 1949.949176643 2486.049366062 2748.16159951 2832.066993053
darwin-amd64-13 2200.732157566 2933.495288619 3134.77419223 3351.202110408
darwin-amd64-nocgo 1463.312098452 1966.804602269 2125.788129161 2187.764026593

Edit: these are for the Go repo only, not subrepos.

@cagedmantis
Copy link
Contributor

Should we close this issue and mark it as completed? Any additional problems we find can be addressed in more specific issues if necessary.

@heschi
Copy link
Contributor

heschi commented Feb 28, 2023

Closing this since it doesn't seem like there's anything in particular left.

@heschi heschi closed this as completed Feb 28, 2023
@github-project-automation github-project-automation bot moved this from In Progress to Done in Go Release Feb 28, 2023
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/484746 mentions this issue: Revert "env/darwin/aws: reduce guest CPU count to 4"

gopherbot pushed a commit to golang/build that referenced this issue Apr 14, 2023
This reverts CL 461775. That CL failed to update cmd/runqemubuildlet,
which is used for production VM execution. Thus it had no impact. The
next CL will change memory, so just revert this for now rather than
fixing to avoid changing two things at once.

For golang/go#48945.
For golang/go#59634.

Change-Id: I9f1e26f2b3ffd6caca9683242a44597b7f1d8d58
Reviewed-on: https://go-review.googlesource.com/c/build/+/484746
Run-TryBot: Michael Pratt <mpratt@google.com>
Auto-Submit: Michael Pratt <mpratt@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Bryan Mills <bcmills@google.com>
Commit-Queue: Michael Pratt <mpratt@google.com>
@golang golang locked and limited conversation to collaborators Apr 13, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Builders x/build issues (builders, bots, dashboards) FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Projects
Archived in project
Development

No branches or pull requests

5 participants