WIP/RFC: "Coalesced" process batching #15648

thejcannon · 2022-05-25T15:25:21Z

!! Opening draft PR as a WIP and RFC !!

This PR introduces one major change, which has many little plumbing changes associated. The ultimate goal is to have the cache be populated per-file when running a process for maximum cacheability, but still run processes on batches of files for performance.

The result being ./pants fmt :: will have a cache entry per tool per file. Therefore running ./pants fmt path/to/file.py immediately after is fully cached.

We accomplish this by:

Adding a new option experimental_coalesced_process_batching defaulting to False so that user can opt-in. This is important as we're blurring the lines which many users might not want, especially in CI.
Plugins can opt into support by constructing a MaybeCoalescedProcessBatch (which is just a CoalescedProcessBatch). This type is made up of a set of "common" process args shared between all to-be-coalesced processes (like env) and a mapping from filename to a new SandboxInfo type. The new type holds the input/output info, and should be instantiated for a file using values that would be used if the user only specified the single file for the goal
- The "maybe" here is used to either passthrough to the engine if the experimental option is True.
- Otherwise we merge all the little-process infos into one process and run it like normal
- My hope here is the overhead introduced by plugin authors writing unconditional code to create little sandbox digests is negligible. Then we gained a simple interface to conditionally opt-into new behavior
The engine takes the new CoalescedProcessBatch and:
- At the cache level, constructs a synthetic Process object per file to be used just for cache key calculation. Uncached process runs get collected and passed through to be run as one coalesced process batch.
- At the actual run-on-metal layer, merges the argv and InputDigeststo make one coalesced process.
- If the process succeeds, cache using the synthetic process key. For now, throw away stdout and stderr (the process was successful, so this is likely a stomachable action). The resulting process digest will be filtered to the relevant output_files and output_directories.

TODOs:

Cache storing today just uses the big-process result. This is not ideal.
For Python, we need to provide a mechanism to turn a CoalescedVenvPexProcessBatch into the associated type
Testing 😄
Need to measure this claim: "My hope here is the overhead introduced by plugin authors writing unconditional code to create little sandbox digests is negligible."
There's a lot of duplication in the engine between vanilla processes and the new types. Especially surrounding "lift"ing the new types from Python to Rust.
Should we ensure there's no overlap in SandboxInfo's output_files? I think so

[ci skip-build-wheels]

thejcannon · 2022-05-25T15:33:02Z

I won't label this because it isn't worth wasting CI time 😛

stuhood

Thanks for sketching this out! I do think that the API is relatively simple.

My next question is just whether enough tools will be able to gain benefit from this to justify the complexity. It primarily impacts fmters and potentially (batched) dependency inference.

Need to measure this claim: "My hope here is the overhead introduced by plugin authors writing unconditional code to create little sandbox digests is negligible."

It looks like it is when compared to code which is 1) already batching, 2) doesn't use dependencies... which is great. It's likely to be more awkward in cases where a tool uses dependencies (since you must do independent graph walks per root). But that is effectively identical to what recursive @rules (generally: compilers) need to do, so not too weird.

Should we ensure there's no overlap in SandboxInfo's output_files? I think so

Yea, probably: overlap would always be a bug.

stuhood · 2022-05-26T23:11:06Z

src/python/pants/engine/process.py

+    if use_coalesced_process_batch:
+        return await Get(FallibleProcessResult, CoalescedProcessBatch, request)


From a safety perspective, it might be the case that when batching is disabled, we should instead run file-at-a-time (as that's the easiest way to flush out bugs in having fully declared the dependencies of a particular sub-process). For example: if a file necessary for process A is only included by subprocess B, then it will be missing from A's cache key, but we will not detect that, because when they do not hit the cache, they will run together here.

Although I suppose that you might still see that bug after a partial-cache hit where you hit the cache for B but not A, and then fail when you try to run A independently...?

Actually... hm. Either batched or unbatched could expose bugs, since both the positive and negative case of a file existing might matter.

The hope here is the config disabled exhibits the behavior we have today of batching. I think per-file would be worse experience (which is why we removed it in lieu of batching).

That being said I don't have a good way to find bugs other than some clever testing 🤔

thejcannon · 2022-05-27T15:17:03Z

My next question is just whether enough tools will be able to gain benefit from this to justify the complexity. It primarily impacts fmters and potentially (batched) dependency inference.

I think formatters and linters and batched dep inference are all excellent candidates, as the linchpin here is "can we get away with no stdout/stderr on success".

thejcannon · 2022-05-28T00:57:20Z

src/python/pants/engine/process.py

+@rule
+async def run_maybe_coalesced_process_batch(
+    request: MaybeCoalescedProcessBatch,
+    use_coalesced_process_batch: ExperimentalCoalescedProcessBatchingOption,
+) -> FallibleProcessResult:


I'm actually thinking this might be feasible as a @rule_helper and we strike it just right, can "do the right thing". E.g. do the normal mega-await in the normal case, or the mini-awaits in the split case.

[ci skip-build-wheels]

benjyw · 2022-07-06T01:59:47Z

Hey, sorry for the very tardy comments on this. I definitely think this is an important direction to be going down. Users may want to make different performance vs. theoretical correctness tradeoffs, and this is the kind of thing that will support that.

Regarding the implementation though, I'm not sure yet if I think a synthetic Process object is the way to go. It seems potentially troublesome to cache a Process as if it ran when it didn't.

The direction I was imagining we could go here is a little different, and it is to differentiate "processes" from "facts", where what we cache is "facts".

One obvious example of a fact is "running this Process produces this output". Those are the only facts we cache today. But there can be other facts. For example, "foobar.py passes this linter with this config". We might be able to establish (and cache) that fact by running a Process on a batch of files that includes foobar.py.

This may seem like hair-splitting on naming, since the synthetic Process is basically the fact abstraction. But it's not entirely.
One example of a proper semantic difference is that a fact can selectively ignore Process fields that don't matter to the establishment of the fact (such as the value of $PATH).

Now I will grant that, done without sufficient care, this opens the door for underspecified cache keys, since it would be easy to omit a Process field that does turn out to matter from the fact key. So maybe a fact is tightly bound to a Process after all, or at least by default. I could be convinced. In which case this is mostly about naming.

But even if this is just naming, naming matters. We do want to create a clear distinction between "a process that actually ran" and "some information the user cares about that we learned from that run".

stuhood · 2022-07-06T16:28:50Z

This may seem like hair-splitting on naming, since the synthetic Process is basically the fact abstraction. But it's not entirely.
One example of a proper semantic difference is that a fact can selectively ignore Process fields that don't matter to the establishment of the fact (such as the value of $PATH).

Given that @Eric-Arellano will be starting on #12203 soon, which will definitely be adjusting the Process abstraction to allow more flexibility in this area, sticking with the Process abstraction so that multiple codepaths can take advantage of it makes the most sense to me.

Using include-list based "facts" rather than exclude-list was very error prone in v1. IMO, we should carefully poke holes in the safety wall, rather than starting with the minimum wall we can identify and adding bricks as we discover bugs.

benjyw · 2022-07-06T17:41:44Z

Using include-list based "facts" rather than exclude-list was very error prone in v1. IMO, we should carefully poke holes in the safety wall, rather than starting with the minimum wall we can identify and adding bricks as we discover bugs.

That's fair, but OTOH caching lies ("this process ran with these outputs") also seems like a recipe for hard-to-debug errors.

At the very least, anything we cache should be annotated with whether it was an actual Process that ran or a synthetic one that was generated from the results of some other processes (I say "annotated" because obviously that field can't be part of the cache key). And at that point what I'm suggesting by way of that annotation is that we name the thing we cache appropriately, even if it wraps a full Process object for safety, it isn't really the same as one, semantically.

stuhood · 2022-07-06T18:04:41Z

Using include-list based "facts" rather than exclude-list was very error prone in v1. IMO, we should carefully poke holes in the safety wall, rather than starting with the minimum wall we can identify and adding bricks as we discover bugs.

That's fair, but OTOH caching lies ("this process ran with these outputs") also seems like a recipe for hard-to-debug errors.

At the very least, anything we cache should be annotated with whether it was an actual Process that ran or a synthetic one that was generated from the results of some other processes (I say "annotated" because obviously that field can't be part of the cache key). And at that point what I'm suggesting by way of that annotation is that we name the thing we cache appropriately, even if it wraps a full Process object for safety, it isn't really the same as one, semantically.

Perhaps. But that "do not let a coalesced Process collide with a normal Process" can be accomplished by adding an environment variable or some other basic marker to a coalesced Process, rather than by using a separate cache API.

thejcannon added 2 commits May 24, 2022 16:05

checkpoint

e277e78

ok WiP

985a7c0

[ci skip-build-wheels]

thejcannon marked this pull request as draft May 25, 2022 15:25

stuhood reviewed May 26, 2022

View reviewed changes

thejcannon commented May 28, 2022

View reviewed changes

thejcannon mentioned this pull request Jun 15, 2022

./pants lint error saying to run fmt should say what to run on #15796

Open

thejcannon added 3 commits June 28, 2022 19:34

Merge branch 'main' into synthcacheproc

2dd8ef5

fmt

9451949

[ci skip-build-wheels]

move around

7214ed5

[ci skip-build-wheels]

thejcannon added 2 commits July 13, 2022 19:59

Merge branch 'main' into synthcacheproc

2aacbc8

idk

7a2556b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP/RFC: "Coalesced" process batching #15648

WIP/RFC: "Coalesced" process batching #15648

thejcannon commented May 25, 2022 •

edited

Loading

thejcannon commented May 25, 2022

stuhood left a comment

stuhood May 26, 2022

stuhood May 26, 2022

thejcannon May 27, 2022

thejcannon commented May 27, 2022

thejcannon May 28, 2022

benjyw commented Jul 6, 2022

stuhood commented Jul 6, 2022 •

edited

Loading

benjyw commented Jul 6, 2022

stuhood commented Jul 6, 2022

		if use_coalesced_process_batch:
		return await Get(FallibleProcessResult, CoalescedProcessBatch, request)

WIP/RFC: "Coalesced" process batching #15648

Are you sure you want to change the base?

WIP/RFC: "Coalesced" process batching #15648

Conversation

thejcannon commented May 25, 2022 • edited Loading

!! Opening draft PR as a WIP and RFC !!

TODOs:

thejcannon commented May 25, 2022

stuhood left a comment

Choose a reason for hiding this comment

stuhood May 26, 2022

Choose a reason for hiding this comment

stuhood May 26, 2022

Choose a reason for hiding this comment

thejcannon May 27, 2022

Choose a reason for hiding this comment

thejcannon commented May 27, 2022

thejcannon May 28, 2022

Choose a reason for hiding this comment

benjyw commented Jul 6, 2022

stuhood commented Jul 6, 2022 • edited Loading

benjyw commented Jul 6, 2022

stuhood commented Jul 6, 2022

thejcannon commented May 25, 2022 •

edited

Loading

stuhood commented Jul 6, 2022 •

edited

Loading