Proposal: Overhaul command line args for choosing a spawn strategy #4153

mmorearty · 2017-11-22T21:41:38Z

This is a proposal for a pretty big change to the command line arguments that are used to choose a build strategy. Please consider it to be a design doc of sorts. It's a strawman; probably needs improvement. It's based on our experiences at Asana, and a number of complex scenarios we have run into.

Introduction

There are currently a number of command line arguments that are used to choose which spawn strategy (e.g. standalone, sandboxed, worker, remote) is used for a given action. Each of these command line arguments serves a specific need; but taken as a whole, they (a) are somewhat difficult to learn, and (b) make it difficult to express some commonly desired workflows.

Here are some (all?) of the current flags:

--spawn_strategy=...: specify the overall default strategy
--genrule_strategy=...: specify the strategy to use for genrules, if different from spawn_strategy
--strategy=<mnemonic>=...: specify an override, the strategy to use with a specific mnemonic
--worker_sandboxing: doesn't actually choose a spawn strategy, but it's related — turns on sandboxing for persistent workers

Also, currently the strategy names local and standalone are synonyms. Apparently local is the old, deprecated name, and standalone is the preferred name meaning "a regular, local, non-sandboxed, non-worker build." In this document, if I use the word "local", I mean "any action that is running on the same machine that Bazel is running on." In other words, "local" is the opposite of "remote"; but in my terminology, a local build might be standalone or sandboxed or worker.

Some current problems

If you want to use a remote cache, and fall back to a local build (--spawn_strategy=remote --remote_rest_cache=...), the local build will always be standalone. There is no way to tell Bazel to fall back to worker or sandboxed.
If you specify worker but, for whatever reason, the action can't be run in worker, it always falls back to standalone. No way to specify sandboxed.
Worker sandboxing (--worker_sandboxing) is separate from other sandboxing (--spawn_strategy=sandboxed). This could lead to lack of clarity on the part of the developer: They might think they have turned on safe, fully sandboxed builds with --spawn_strategy=sandboxed, when in fact anything they run in a worker isn't actually safe.
In more complex scenarios, switching between sandboxed and standalone can require numerous changes to the command line. For example, as mentioned above, if you are using workers then it might require changing both the --spawn_strategy and the --worker_sandboxing arguments. Or, suppose you usually use remote, but you have several specific actions that you don't want to be cached remotely. So you might have a bazelrc file that looks like this:
```
  build --spawn_strategy=remote
  build --strategy=SpecialCase1=standalone
  build --strategy=SpecialCase2=standalone
  build --strategy=SpecialCase3=standalone
  build --strategy=SpecialCase4=standalone
  build --strategy=SpecialCase5=standalone
```
Switching that to use sandboxing will require changing five lines. Putting standalone did not actually express your goal; your goal was actually "do the build locally, somehow; don't use remote."
Finally, a different kind of problem: Today's command-line flags can take a long time to get comfortable with. For example, the difference between --spawn_strategy and --strategy; questions such as "if an item is not in the cache, what sort of local build does it fall back to, and can I control that" (I learned by reading the source); and so on.

Strategy selection goals

I feel that it would be desirable for the strategy-related command line arguments had the following characteristics:

A flag can apply to either all mnemonics, or to only one. For example, eliminate the difference between --spawn_strategy and --strategy.
A flag expresses one characteristic of the strategy selection process; it does not entirely define what strategy to use. For example, "when doing any local build, use sandboxing if possible" (separate from specifying whether to use a worker), or "if the remote cache does not contain the artifacts, do some kind of local build" (without defining what kind).
The flags are a little easier for a newcomer to get a handle in, in terms of the relationships between them.

It would then be up to Bazel to take this set of guidelines, and resolve it to an appropriate build action.

Examples

A complex scenario might play out like this:

The args say to allow downloads from the remote cache (maybe with just --remote_rest_cache=..., no more args needed). So Bazel queries the remote cache, but does not find the build artifacts.
The args say local builds are allowed (--remote_local_fallback, on by default). So Bazel needs to decide which local strategy to use.
The args say to allow persistent workers for any actions that support them (maybe --workers or something). But this action does not.
The args say to use sandboxing when possible (maybe --sandboxing). So we do a local, sandboxed build.
The args say that locally built, sandboxed artifacts are safe to upload to the remote cache (maybe --remote_upload_sandboxed_results). So Bazel uploads the build artifacts.

Another scenario, this time with a non-sandboxed local build:

same as above
same as above
same as above
The args allow sandboxing for most actions, but not this one (maybe --sandboxing plus --nosandboxing=ThisMnemonic). So, no sandboxing.
Bazel does a standalone build.
The args say that locally built, sandboxed artifacts are safe to upload to the remote cache, but un-sandboxed local builds are not safe (maybe --noremote_upload_standalone_results; the default). So Bazel does not upload the build artifacts.

Another scenario, with no remote cache, and with a worker:

No remote cache is specified. So, we will do a local build.
The args say to allow persistent workers for any actions that support them (maybe --workers or something). This action does support workers. So, great.
The args say to use sandboxing when available (--sandboxing). So, we do the build in a sandboxed worker.

Strategy selection algorithm

The strategy selection approach is pretty clearly indicated by the above examples. To spell it out:

If there is a remote cache, check to see if the artifacts are there already. they are, we're done. Otherwise, continue to step 2.
Choose between remote or local execution. If remote execution is available, choose that, and we're done. Otherwise choose local, and continue to step 3.
Choose worker vs. non-worker.
Choose sandboxed vs. non-sandboxed.
Do the local build. At this point we are doing some combination of (worker/non-worker) + (sandboxed/non-sandboxed).
After the build, if there is not a remote cache, we're done. Otherwise, continue to step 7.
If the build was sandboxed, and sandboxed artifacts are allowed to upload to the remote cache, then upload.
If the build was non-sandboxed, and non-sandboxed artifacts are allowed to upload to the remote cache, then upload.

Suggested flags

Although I'm happy with the spirit of the following suggestions, I'm not sure if these are the best names for the flags.

Remote execution:

I don't really know; I don't have any experience with remote execution. But I think --remote_executor, with nothing more, would imply that remote execution should be used.

Remote caching:

--remote_rest_cache=...: Same as today. Implies allowing downloads from the remote cache, and falling back to a local build.
--[no]remote_upload_sandboxed_results (default true) and --[no]remote_upload_nonsandboxed_results (default false). Note, it says nonsandboxed instead of standalone, because e.g. non-sandboxed worker builds should fit in the remote_upload_nonsandboxed_results category.
--[no]remote_upload_worker_results (default true). (As an aside, It doesn't make any sense to add a --[no]remote_upload_nonworker_results, because that is implied by the presence of --remote_rest_cache.)

With those defaults for remote-cache uploads, a sandboxed worker build would upload; a non-sandboxed worker build would not.

Sandboxing:

--[no]sandboxing: This would mean: "For all mnemonics: If we are doing a local build, and if sandboxes are supported, do it in a sandbox." Or with "no" prefix, never use a sandbox. Applies to local worker worker in addition to local non-worker actions.
--[no]sandboxing=<mnemonic>: Same, but for a specific mnemonic.

(Open issue: Maybe it should be --strategy_sandboxing instead of --sandboxing)

Workers:

--[no]workers: This would mean: "For all mnemonics: If we are doing a local build, and if persistent workers are supported, do it in a persistent worker." Or with "no" prefix, never use a persistent worker.
--[no]workers=<mnemonic>: Same, but for a specific mnemonic.
--[no]worker_sandboxing: Whether to allow worker+sandbox. It's important to default to true (the opposite of today's default) in order to make the model consistent and easy to undertstand.

(Open issue: Maybe it should be --strategy_workers instead of --workers. But that would not affect --worker_sandboxing, which is in a separate option group.)

No need for strategy-specific "fallback" strategies

One result of all of the above is that the list of strategies now becomes a "flat" list, with no need for one strategy to know about any others. (The worker-with-sandboxing scenario is a little subtle; but that is basically just the worker strategy with an option for whether to use sandboxing.)

As the code is written today, many strategies -- remote, worker, and sandbox -- need a "fallback" capability right in the runner code (and they usually fall back to standalone). With this new approach, the selection process would happen before we get to the runner; the runner can just do its work.

Not only does this make the code simpler, but it also (as described extensively above) simplifies the mental model for the user. There is no need to ask questions like "what does a remote build fall back to"; it just falls back to an appropriate kind of local build.

In fact, to some degree, the concept of "strategies" becomes less important to the end user. To the user, it's more of just a multidimensional selection process of different characteristics of the build (remote execution yes/no, remote cache yes/no, worker yes/no, sandboxed yes/no). It's a little easier (for me anyway) to think of it that way than to think of it as "I'm using the remote strategy, which fell back to worker" or "I'm using the worker strategy, which fell back to standalone".

The "problem" scenarios from above

With these new flags, here is how the problem scenarios from above would be handled:

If you want to use a remote cache, and fall back to a local build, just use --remote_rest_cache=...; the rest of the selection will fall out naturally from other flags.
If you specify worker but, for whatever reason, the action can't be run in worker, it will fall back to sandboxed if that has been specified and is available.
Worker sandboxing is (by default) separate from other sandboxing. So if you turn on sandboxing, you have turned it on for everything.
Switching between sandboxed and standalone now just requires --[no]sandboxing, and nothing more.

The text was updated successfully, but these errors were encountered:

meteorcloudy · 2017-11-23T09:06:13Z

Wow, this looks like a very nice suggestion, thank you!
But I'm not the expert here, FYI, @philwo @buchgr

philwo · 2017-11-23T15:41:15Z

Thanks a lot! I'll read and think through it in the next days. :)

ulfjack · 2017-11-23T18:21:27Z

There is another dimension which you didn't take into account. In particular, we allow action 'tags' to enable or disable specific features. For example, an action marked as 'nosandbox' can disable sandboxing for that action, and an action marked as 'worker' enables persistent workers for that action.

My plan was to have one flag to specify strategy order and fallback, and another flag to add or remove tags from actions identified by mnemonic. (And of course, per-strategy flags to configure the strategy.) One difference to your proposal is that my proposal does not unify worker sandboxing and sandboxing.

TODO: I need to write down a more complete description.

ittaiz · 2017-11-24T06:40:40Z

Two use-cases I'd like to add here are: 1. Trying to read from a remote cache and falling back to "local remote executor" (LRE). This serves a few purposes. It enables network isolation on OS X, more closely aligns one with remote execution and sidesteps sandboxing issues which don't exist in remote execution. 2. Another use-case which is related to the above is the ability to decide on remote execution for different filters, in my case based on the test tags. This is to enable to run fast unit-tests locally with sandboxing (since they're usually very isolated from the environment) and to run integration tests in the LRE. #2 might be a bit disconnected from this issue but I think #1 is definitely related here

…

On Thu, 23 Nov 2017 at 20:21 Ulf Adams ***@***.***> wrote: There is another dimension which you didn't take into account. In particular, we allow action 'tags' to enable or disable specific features. For example, an action marked as 'nosandbox' can disable sandboxing for that action, and an action marked as 'worker' enables persistent workers for that action. My plan was to have one flag to specify strategy order and fallback, and another flag to add or remove tags from actions identified by mnemonic. (And of course, per-strategy flags to configure the strategy.) One difference to your proposal is that my proposal does not unify worker sandboxing and sandboxing. TODO: I need to write down a more complete description. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#4153 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABUIFxSaFBSoq30dIGYhCTXNF9mdW_Y7ks5s5beqgaJpZM4QoCBw> .

mmorearty · 2017-11-24T07:43:50Z

Ulf, you're right that I neglected to mention tags. As far as I know, the currently available tags could be thought of as constraints on the set of available strategies for that action (e.g. "workers are allowed" or "sandboxing is not allowed"). I was picturing it working as follows: Let's say the command line flags include "--sandbox" or something, meaning to use sandboxing when possible, but a specific action has "nosandbox," meaning that rule does not allow sandboxing — then that rule would just run non-sandboxed. (As an aside, I didn't know about "nosandbox", because it is not in the documentation. I'll add an issue asking that it be added to the docs. When you say "worker" are you referring to the "supports-workers" value with an action's "execution_requirements"? That's all I know about.) Regarding unifying of worker sandboxing and non-worker sandboxing, that's one part of my proposal that I'm a little hesitant about. From a user perspective that feels like the right thing to do — if a user wants sandboxing then they probably want it everywhere, or, at a minimum, they don't want to be surprised to learn that even though they thought they had turned sandboxing on, it was still off for some of their rules. From an implementation perspective, I haven't looked at the code but I wouldn't be surprised if worker sandboxing and non-worker sandboxing are quite different beasts, so it might feel wrong to put them under the same umbrella. The thing about avoiding "surprise" feels pretty important to me, though. Strategy order and fallback might be a reasonable way to do things. I know it's a bit unusual for me to write a super-long proposal out of the blue, especially since it's unlikely that I would be doing the work; feel free to ignore whatever you want to ignore, but I do hope that the use cases I spelled out are informative, and that you consider them if you work on a strategy-order-and-fallback set of command line flags. My motivations for writing this were that (a) it took us a long time to get comfortable and familiar with the nuances of strategy selection, and (b) we keep running into scenarios that are either messy or impossible with the current set of flags. (Having the remote cache fall back to a sandboxed build was a very high priority for us, so I got that working in our fork in a slightly hacky way, but I'd love not to have our own fork.) Thanks! - Mike

…

On Thu, Nov 23, 2017 at 10:21 AM, Ulf Adams ***@***.***> wrote: There is another dimension which you didn't take into account. In particular, we allow action 'tags' to enable or disable specific features. For example, an action marked as 'nosandbox' can disable sandboxing for that action, and an action marked as 'worker' enables persistent workers for that action. My plan was to have one flag to specify strategy order and fallback, and another flag to add or remove tags from actions identified by mnemonic. (And of course, per-strategy flags to configure the strategy.) One difference to your proposal is that my proposal does not unify worker sandboxing and sandboxing. TODO: I need to write down a more complete description. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#4153 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAMrN6IqwSzz7WbN0X9loHaKopUcvFphks5s5bexgaJpZM4QoCBw> .

mmorearty · 2017-11-24T07:54:06Z

One more point about strategies, as food for thought: I think that for those of you who are the actual developers on Bazel, it's quite natural to think in terms of strategies. From the perspective of the Bazel source code, strategies are an important abstraction — they define what needs to happen in order to get the build artifacts. But from a user's perspective, I could see a possible world where strategies are not an essential concept for the user to think about. It's obvious to the user, for example, that if you can't find something in the cache then you'll have to run a compiler; it's obvious that sandboxing will involve doing some setup stuff before running the compiler, and some cleanup after; and so on. The way I think of how Bazel runs one action is: "It will look in the remote cache. If it can't find what it needs, it will run a build tool somewhere. There will be some qualifications on how it runs that build tool: is it sandboxed, will it use a worker, etc." Strategies don't come into my mental model (or at least, they didn't until I got much more familiar with Bazel). That's why, in a strategy-centered view of the world, it feels a little unnatural to unify worker sandboxing and non-worker sandboxing: From a strategy perspective, those are just different things. But viewed a different way, they are more closely related.

…

On Thu, Nov 23, 2017 at 11:43 PM, Mike Morearty ***@***.***> wrote: Ulf, you're right that I neglected to mention tags. As far as I know, the currently available tags could be thought of as constraints on the set of available strategies for that action (e.g. "workers are allowed" or "sandboxing is not allowed"). I was picturing it working as follows: Let's say the command line flags include "--sandbox" or something, meaning to use sandboxing when possible, but a specific action has "nosandbox," meaning that rule does not allow sandboxing — then that rule would just run non-sandboxed. (As an aside, I didn't know about "nosandbox", because it is not in the documentation. I'll add an issue asking that it be added to the docs. When you say "worker" are you referring to the "supports-workers" value with an action's "execution_requirements"? That's all I know about.) Regarding unifying of worker sandboxing and non-worker sandboxing, that's one part of my proposal that I'm a little hesitant about. From a user perspective that feels like the right thing to do — if a user wants sandboxing then they probably want it everywhere, or, at a minimum, they don't want to be surprised to learn that even though they thought they had turned sandboxing on, it was still off for some of their rules. From an implementation perspective, I haven't looked at the code but I wouldn't be surprised if worker sandboxing and non-worker sandboxing are quite different beasts, so it might feel wrong to put them under the same umbrella. The thing about avoiding "surprise" feels pretty important to me, though. Strategy order and fallback might be a reasonable way to do things. I know it's a bit unusual for me to write a super-long proposal out of the blue, especially since it's unlikely that I would be doing the work; feel free to ignore whatever you want to ignore, but I do hope that the use cases I spelled out are informative, and that you consider them if you work on a strategy-order-and-fallback set of command line flags. My motivations for writing this were that (a) it took us a long time to get comfortable and familiar with the nuances of strategy selection, and (b) we keep running into scenarios that are either messy or impossible with the current set of flags. (Having the remote cache fall back to a sandboxed build was a very high priority for us, so I got that working in our fork in a slightly hacky way, but I'd love not to have our own fork.) Thanks! - Mike On Thu, Nov 23, 2017 at 10:21 AM, Ulf Adams ***@***.***> wrote: > There is another dimension which you didn't take into account. In > particular, we allow action 'tags' to enable or disable specific features. > For example, an action marked as 'nosandbox' can disable sandboxing for > that action, and an action marked as 'worker' enables persistent workers > for that action. > > My plan was to have one flag to specify strategy order and fallback, and > another flag to add or remove tags from actions identified by mnemonic. > (And of course, per-strategy flags to configure the strategy.) One > difference to your proposal is that my proposal does not unify worker > sandboxing and sandboxing. > > TODO: I need to write down a more complete description. > > — > You are receiving this because you authored the thread. > Reply to this email directly, view it on GitHub > <#4153 (comment)>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/AAMrN6IqwSzz7WbN0X9loHaKopUcvFphks5s5bexgaJpZM4QoCBw> > . >

ittaiz · 2017-11-24T08:21:43Z

that if you can't find something in the cache then you'll have
to run a compiler

big +1

ulfjack · 2017-11-24T09:59:15Z

Ha! I disagree that it's natural to think in terms of strategies. We've actually had a lot of problems in the past, and I've independently come to the conclusion that we'll need to redo the way this is currently done.

In general, I agree with the points you're making, but I wanted to point out that the proposal is incomplete, and there are a few additional use cases we need to cover.

- remove BaseSpawn.Local; instead, all callers pass in the full set of execution requirements they want to set - disable caching and sandboxing for the symlink tree action - it does not declare outputs, so it can't be cached or sandboxed (fixes #4041) - centralize the existing execution requirements in the ExecutionRequirements class - centralize checking for execution requirements in the Spawn class (it's possible that we may need a more decentralized, extensible design in the future, but for now having them in a single place is simple and effective) - update the documentation - forward the relevant tags to execution requirements in TargetUtils (progress on #3960) - this also contributes to #4153 PiperOrigin-RevId: 177288598

- remove BaseSpawn.Local; instead, all callers pass in the full set of execution requirements they want to set - disable caching and sandboxing for the symlink tree action - it does not declare outputs, so it can't be cached or sandboxed (fixes bazelbuild#4041) - centralize the existing execution requirements in the ExecutionRequirements class - centralize checking for execution requirements in the Spawn class (it's possible that we may need a more decentralized, extensible design in the future, but for now having them in a single place is simple and effective) - update the documentation - forward the relevant tags to execution requirements in TargetUtils (progress on bazelbuild#3960) - this also contributes to bazelbuild#4153 PiperOrigin-RevId: 177288598

jin · 2019-09-03T15:48:59Z

cc @buchgr @ishikhman with https://blog.bazel.build/2019/06/19/list-strategy.html and other work on strategy selection recently, is this issue resolved?

ishikhman · 2019-09-04T12:55:39Z

Yes, I think so. It seems that it's been covered by #7480

Closing the issue for now.

meteorcloudy added P2 We'll consider working on this in future. (Assignee optional) type: feature request labels Nov 23, 2017

mmorearty mentioned this issue Nov 24, 2017

Document the "nosandbox" execution_requirement on actions bazelbuild/bazel-website#79

Closed

philwo added the category: misc > misc label Dec 5, 2017

jin added the team-Remote-Exec Issues and PRs for the Execution (Remote) team label Sep 3, 2019

ishikhman closed this as completed Sep 4, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Overhaul command line args for choosing a spawn strategy #4153

Proposal: Overhaul command line args for choosing a spawn strategy #4153

mmorearty commented Nov 22, 2017

meteorcloudy commented Nov 23, 2017

philwo commented Nov 23, 2017

ulfjack commented Nov 23, 2017

ittaiz commented Nov 24, 2017 via email

mmorearty commented Nov 24, 2017 via email

mmorearty commented Nov 24, 2017 via email

ittaiz commented Nov 24, 2017 •

edited

Loading

ulfjack commented Nov 24, 2017

jin commented Sep 3, 2019

ishikhman commented Sep 4, 2019

Proposal: Overhaul command line args for choosing a spawn strategy #4153

Proposal: Overhaul command line args for choosing a spawn strategy #4153

Comments

mmorearty commented Nov 22, 2017

Introduction

Some current problems

Strategy selection goals

Examples

Strategy selection algorithm

Suggested flags

No need for strategy-specific "fallback" strategies

The "problem" scenarios from above

meteorcloudy commented Nov 23, 2017

philwo commented Nov 23, 2017

ulfjack commented Nov 23, 2017

ittaiz commented Nov 24, 2017 via email

mmorearty commented Nov 24, 2017 via email

mmorearty commented Nov 24, 2017 via email

ittaiz commented Nov 24, 2017 • edited Loading

ulfjack commented Nov 24, 2017

jin commented Sep 3, 2019

ishikhman commented Sep 4, 2019

ittaiz commented Nov 24, 2017 •

edited

Loading