-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
x/playground: use gVisor on linux/amd64 binaries instead of NaCl for sandboxing #25224
Comments
This has come up again as attempting to upgrade the playground to 1.11 results in the following breakage during the NaCl time patching phase:
/cc @bradfitz |
That's @bcmills's patch. The old patch (enable-fake-time.patch) was designed to basically never bit rot but the new patch seems to have more context that makes it fragile. We should really upstream this and have it behind a bool or build tag so it's less likely to rot. |
I've been playing around with this. I'm deciding between:
I think I'm leaning towards (2). As part of this, we'd migrate to (2) well before the Go 1.14 release (which drops nacl), so the release process doesn't involve changing playground machinery during a release. We'd still support Go 1.13 nacl mode with the new infrastructure, and then the Go 1.14 release would just be changing a single line in its Dockerfile. |
@bradfitz Have you investigated using Cloud Run? The documentation seems to suggest that the runtime user is root: https://cloud.google.com/run/docs/tips#container-security |
@toothrot, Cloud Run is already under gVisor itself. (and it doesn't let us tweak its sandbox config to be more restrictive, as we'd need). Cloud Run gives you root, but fake-ish constrained root. We definitely couldn't use gVisor's KVM mode ("platform") under it. We could probably just the ptrace mode (ptrace syscall is at least partially supported: https://gvisor.dev/docs/user_guide/compatibility/linux/amd64/). But once runsc went to do the other root-requiring stuff (setting up namespaces, etc), I suspect we'd trip over unimplemented system calls. e.g. I see on that system call list:
So basically, to use Cloud Run we'd need to do the same work hacking up runsc to run rootless that we'd have to do to run it on its current App Engine Flex environment. That said, we should move many of our things to Cloud Run regardless, but I don't think it'd help us here. @prattmic can correct me if indeed the ptrace mode of runsc would work unmodified under gvisor. That would be a happy surprise. |
@prattmic confirms that nested gVisor doesn't work yet. |
In case we do end up going the path of using some modified root-less runsc directly, without docker, this syzkaller code drives runsc directly: https://github.com/google/syzkaller/blob/master/vm/gvisor/gvisor.go |
Change https://golang.org/cl/195983 mentions this issue: |
For what it's worth, this doesn't seem to work right: gVisor runs see a lot of "Error communicating with remote server." and when it does work, stdout is not visible. https://play.golang.org/p/5_6CpdUjYY1 |
@tv42, it's not fully deployed yet. |
Reopening until it's all deployed, though. |
@toothrot, next up we need to run the frontend in a different network: https://cloud.google.com/appengine/docs/flexible/go/using-shared-vpc Have you done this before? (I haven't.) Because right now it can't contact the internal load balancer. We should test it first at a play-test.golang.org version and verify that it works (and doesn't break, say, memcache connections). |
Change https://golang.org/cl/229418 mentions this issue: |
Change https://golang.org/cl/229677 mentions this issue: |
This change avoids cascading sandbox backend failures to the frontend instances. The timeout TODO is safe to remove as we now pass in the request context for cancellation, and sandboxBuild has its own build timeout. Updates golang/go#25224 Updates golang/go#38530 Change-Id: If892f86bad08c55429b6ebab768b83c5d7621cf1 Reviewed-on: https://go-review.googlesource.com/c/playground/+/229677 Run-TryBot: Alexander Rakoczy <alex@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Dmitri Shuralyov <dmitshur@golang.org> Reviewed-by: Carlos Amedee <carlos@golang.org>
Change https://golang.org/cl/229679 mentions this issue: |
Change https://golang.org/cl/229680 mentions this issue: |
- Specify the correct image in konlet.yaml. gvisor-playground-sandbox is the child-process container. - Correct interpolation in config identifiers, which is deprecated. - Set min_ready_sec for update policy to not cause an outage when updating - Use name_prefix for instance_template instead of name, which allows updates. Templates are immutable, so previously this was not possible to update. Updates golang/go#38530 Updates golang/go#25224 Change-Id: I3f7618b8e378eaa9714e571b90390b7052bf2855 Reviewed-on: https://go-review.googlesource.com/c/playground/+/229418 Run-TryBot: Alexander Rakoczy <alex@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Dmitri Shuralyov <dmitshur@golang.org>
Change https://golang.org/cl/229681 mentions this issue: |
Change https://golang.org/cl/229957 mentions this issue: |
This conforms with a policy change, and should save money with no measurable performance impact. Removes bradfitz's SSH key from instances (that are inaccessible anyway). Updates golang/go#25224 Change-Id: I1733192c98deee1deabf2237ae5fe19edd29ab93 Reviewed-on: https://go-review.googlesource.com/c/playground/+/229957 Run-TryBot: Alexander Rakoczy <alex@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Carlos Amedee <carlos@golang.org>
This changes the sandbox VMs to allow them to write to StackDriver metrics. Updates golang/go#25224 Updates golang/go#38530 Change-Id: I82954b8eed3664289f5c69c0f5301a72206f0948 Reviewed-on: https://go-review.googlesource.com/c/playground/+/229681 Run-TryBot: Alexander Rakoczy <alex@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Dmitri Shuralyov <dmitshur@golang.org>
This change adds OpenCensus's HTTP instrumentation to the sandbox. In development mode, it exposes a prometheus metrics interface on /statusz. This is the first in a series of CLs to add instrumentation to different parts of the sandbox to help investigate instability issues. For now, reporting metrics around our responses per handler to StackDriver will be helpful. OpenTelemetry would be preferable, as it is the successor of OpenCensus, however the StackDriver integration is not quite done. Updates golang/go#25224 Updates golang/go#38530 Change-Id: I600fd695bb66c8bee16bc0b778d51930f4cdd476 Reviewed-on: https://go-review.googlesource.com/c/playground/+/229679 Run-TryBot: Alexander Rakoczy <alex@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Dmitri Shuralyov <dmitshur@golang.org> Reviewed-by: Carlos Amedee <carlos@golang.org>
This change adds a metric to count the number of running containers, according to Docker. Updates golang/go#25224 Updates golang/go#38530 Change-Id: Id989986928dff594cb1de0903a56dcffed8220c4 Reviewed-on: https://go-review.googlesource.com/c/playground/+/229680 Run-TryBot: Alexander Rakoczy <alex@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Carlos Amedee <carlos@golang.org>
Change https://golang.org/cl/229958 mentions this issue: |
The generic_task target has a required label format that was not being met. Moves metadata fetching into a helper function. Removes call to view.RegisterExporter for StackDriver exporter, which was unncessary. Updates golang/go#25224 Updates golang/go#38530 Change-Id: Ib009f5ce906f5b9479cdda8c7e8322d06e3036e4 Reviewed-on: https://go-review.googlesource.com/c/playground/+/229958 Run-TryBot: Alexander Rakoczy <alex@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Carlos Amedee <carlos@golang.org>
Change https://golang.org/cl/229980 mentions this issue: |
Change https://golang.org/cl/229981 mentions this issue: |
This change measures the latency and success of container creation. These metrics will help capacity planning and investigating production issues. Updates golang/go#25224 Updates golang/go#38530 Change-Id: Id7f373acb8741d4465c6e632badb188b6e855787 Reviewed-on: https://go-review.googlesource.com/c/playground/+/229980 Run-TryBot: Alexander Rakoczy <alex@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Dmitri Shuralyov <dmitshur@golang.org>
Creating a container in the sandbox takes 500ms to 1s. The sandbox was creating containers serially, but serving requests in parallel. This means that we can be starved for workers with a trivial number of requests. In addition, the sandbox in production is not CPU bound, meaning we probably have room to do some extra work while serving a request. This CL introduces a worker pool to create containers. It also changes the readyContainer chan to unbuffered to avoid having twice as many containers as we expect while idle (the container waiting to be sent plus those already in the channel's buffer). Updates golang/go#25224 Updates golang/go#38530 Change-Id: I0e535cf65409c3dbf32329577a1c0687c2614a0d Reviewed-on: https://go-review.googlesource.com/c/playground/+/229981 Run-TryBot: Alexander Rakoczy <alex@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Carlos Amedee <carlos@golang.org>
Cross-linking this comment mentioning our new sandbox metrics for posterity: #38530 (comment) |
I misunderstood in the referenced issue #38530: you can use custom metrics for autoscaling: https://cloud.google.com/compute/docs/autoscaler/scaling-stackdriver-monitoring-metrics#custom_metrics QPS or container_count could be good candidates. We'd have to update our target to be a GCE instance rather than a generic_task for the metric we choose. |
I think this is effectively finished. |
Native Client deprecation has been announced in favor of Web Assembly. It's unclear what that means in terms of NaCl's future development and support.
I'd like us to investigate using gVisor as our sandbox mechanism for the playground. This is not a statement that we should definitively switch over to gVisor.
https://github.com/google/gvisor
@ysmolsky
The text was updated successfully, but these errors were encountered: