Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP/RFC: "Coalesced" process batching #15648

Draft
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

thejcannon
Copy link
Member

@thejcannon thejcannon commented May 25, 2022

!! Opening draft PR as a WIP and RFC !!

This PR introduces one major change, which has many little plumbing changes associated. The ultimate goal is to have the cache be populated per-file when running a process for maximum cacheability, but still run processes on batches of files for performance.

The result being ./pants fmt :: will have a cache entry per tool per file. Therefore running ./pants fmt path/to/file.py immediately after is fully cached.

We accomplish this by:

  • Adding a new option experimental_coalesced_process_batching defaulting to False so that user can opt-in. This is important as we're blurring the lines which many users might not want, especially in CI.
  • Plugins can opt into support by constructing a MaybeCoalescedProcessBatch (which is just a CoalescedProcessBatch). This type is made up of a set of "common" process args shared between all to-be-coalesced processes (like env) and a mapping from filename to a new SandboxInfo type. The new type holds the input/output info, and should be instantiated for a file using values that would be used if the user only specified the single file for the goal
    • The "maybe" here is used to either passthrough to the engine if the experimental option is True.
    • Otherwise we merge all the little-process infos into one process and run it like normal
    • My hope here is the overhead introduced by plugin authors writing unconditional code to create little sandbox digests is negligible. Then we gained a simple interface to conditionally opt-into new behavior
  • The engine takes the new CoalescedProcessBatch and:
    • At the cache level, constructs a synthetic Process object per file to be used just for cache key calculation. Uncached process runs get collected and passed through to be run as one coalesced process batch.
    • At the actual run-on-metal layer, merges the argv and InputDigeststo make one coalesced process.
    • If the process succeeds, cache using the synthetic process key. For now, throw away stdout and stderr (the process was successful, so this is likely a stomachable action). The resulting process digest will be filtered to the relevant output_files and output_directories.

TODOs:

  • Cache storing today just uses the big-process result. This is not ideal.
  • For Python, we need to provide a mechanism to turn a CoalescedVenvPexProcessBatch into the associated type
  • Testing 😄
  • Need to measure this claim: "My hope here is the overhead introduced by plugin authors writing unconditional code to create little sandbox digests is negligible."
  • There's a lot of duplication in the engine between vanilla processes and the new types. Especially surrounding "lift"ing the new types from Python to Rust.
  • Should we ensure there's no overlap in SandboxInfo's output_files? I think so

[ci skip-build-wheels]

[ci skip-build-wheels]
@thejcannon thejcannon marked this pull request as draft May 25, 2022 15:25
@thejcannon
Copy link
Member Author

I won't label this because it isn't worth wasting CI time 😛

Copy link
Member

@stuhood stuhood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sketching this out! I do think that the API is relatively simple.

My next question is just whether enough tools will be able to gain benefit from this to justify the complexity. It primarily impacts fmters and potentially (batched) dependency inference.

Need to measure this claim: "My hope here is the overhead introduced by plugin authors writing unconditional code to create little sandbox digests is negligible."

It looks like it is when compared to code which is 1) already batching, 2) doesn't use dependencies... which is great. It's likely to be more awkward in cases where a tool uses dependencies (since you must do independent graph walks per root). But that is effectively identical to what recursive @rules (generally: compilers) need to do, so not too weird.

Should we ensure there's no overlap in SandboxInfo's output_files? I think so

Yea, probably: overlap would always be a bug.

Comment on lines +344 to +345
if use_coalesced_process_batch:
return await Get(FallibleProcessResult, CoalescedProcessBatch, request)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a safety perspective, it might be the case that when batching is disabled, we should instead run file-at-a-time (as that's the easiest way to flush out bugs in having fully declared the dependencies of a particular sub-process). For example: if a file necessary for process A is only included by subprocess B, then it will be missing from A's cache key, but we will not detect that, because when they do not hit the cache, they will run together here.

Although I suppose that you might still see that bug after a partial-cache hit where you hit the cache for B but not A, and then fail when you try to run A independently...?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually... hm. Either batched or unbatched could expose bugs, since both the positive and negative case of a file existing might matter.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hope here is the config disabled exhibits the behavior we have today of batching. I think per-file would be worse experience (which is why we removed it in lieu of batching).

That being said I don't have a good way to find bugs other than some clever testing 🤔

@thejcannon
Copy link
Member Author

My next question is just whether enough tools will be able to gain benefit from this to justify the complexity. It primarily impacts fmters and potentially (batched) dependency inference.

I think formatters and linters and batched dep inference are all excellent candidates, as the linchpin here is "can we get away with no stdout/stderr on success".

Comment on lines +339 to +343
@rule
async def run_maybe_coalesced_process_batch(
request: MaybeCoalescedProcessBatch,
use_coalesced_process_batch: ExperimentalCoalescedProcessBatchingOption,
) -> FallibleProcessResult:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually thinking this might be feasible as a @rule_helper and we strike it just right, can "do the right thing". E.g. do the normal mega-await in the normal case, or the mini-awaits in the split case.

[ci skip-build-wheels]
[ci skip-build-wheels]
@benjyw
Copy link
Contributor

benjyw commented Jul 6, 2022

Hey, sorry for the very tardy comments on this. I definitely think this is an important direction to be going down. Users may want to make different performance vs. theoretical correctness tradeoffs, and this is the kind of thing that will support that.

Regarding the implementation though, I'm not sure yet if I think a synthetic Process object is the way to go. It seems potentially troublesome to cache a Process as if it ran when it didn't.

The direction I was imagining we could go here is a little different, and it is to differentiate "processes" from "facts", where what we cache is "facts".

One obvious example of a fact is "running this Process produces this output". Those are the only facts we cache today. But there can be other facts. For example, "foobar.py passes this linter with this config". We might be able to establish (and cache) that fact by running a Process on a batch of files that includes foobar.py.

This may seem like hair-splitting on naming, since the synthetic Process is basically the fact abstraction. But it's not entirely.
One example of a proper semantic difference is that a fact can selectively ignore Process fields that don't matter to the establishment of the fact (such as the value of $PATH).

Now I will grant that, done without sufficient care, this opens the door for underspecified cache keys, since it would be easy to omit a Process field that does turn out to matter from the fact key. So maybe a fact is tightly bound to a Process after all, or at least by default. I could be convinced. In which case this is mostly about naming.

But even if this is just naming, naming matters. We do want to create a clear distinction between "a process that actually ran" and "some information the user cares about that we learned from that run".

@stuhood
Copy link
Member

stuhood commented Jul 6, 2022

This may seem like hair-splitting on naming, since the synthetic Process is basically the fact abstraction. But it's not entirely.
One example of a proper semantic difference is that a fact can selectively ignore Process fields that don't matter to the establishment of the fact (such as the value of $PATH).

Given that @Eric-Arellano will be starting on #12203 soon, which will definitely be adjusting the Process abstraction to allow more flexibility in this area, sticking with the Process abstraction so that multiple codepaths can take advantage of it makes the most sense to me.

Using include-list based "facts" rather than exclude-list was very error prone in v1. IMO, we should carefully poke holes in the safety wall, rather than starting with the minimum wall we can identify and adding bricks as we discover bugs.

@benjyw
Copy link
Contributor

benjyw commented Jul 6, 2022

Using include-list based "facts" rather than exclude-list was very error prone in v1. IMO, we should carefully poke holes in the safety wall, rather than starting with the minimum wall we can identify and adding bricks as we discover bugs.

That's fair, but OTOH caching lies ("this process ran with these outputs") also seems like a recipe for hard-to-debug errors.

At the very least, anything we cache should be annotated with whether it was an actual Process that ran or a synthetic one that was generated from the results of some other processes (I say "annotated" because obviously that field can't be part of the cache key). And at that point what I'm suggesting by way of that annotation is that we name the thing we cache appropriately, even if it wraps a full Process object for safety, it isn't really the same as one, semantically.

@stuhood
Copy link
Member

stuhood commented Jul 6, 2022

Using include-list based "facts" rather than exclude-list was very error prone in v1. IMO, we should carefully poke holes in the safety wall, rather than starting with the minimum wall we can identify and adding bricks as we discover bugs.

That's fair, but OTOH caching lies ("this process ran with these outputs") also seems like a recipe for hard-to-debug errors.

At the very least, anything we cache should be annotated with whether it was an actual Process that ran or a synthetic one that was generated from the results of some other processes (I say "annotated" because obviously that field can't be part of the cache key). And at that point what I'm suggesting by way of that annotation is that we name the thing we cache appropriately, even if it wraps a full Process object for safety, it isn't really the same as one, semantically.

Perhaps. But that "do not let a coalesced Process collide with a normal Process" can be accomplished by adding an environment variable or some other basic marker to a coalesced Process, rather than by using a separate cache API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants