-
-
Notifications
You must be signed in to change notification settings - Fork 646
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP/RFC: "Coalesced" process batching #15648
base: main
Are you sure you want to change the base?
Conversation
I won't label this because it isn't worth wasting CI time 😛 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for sketching this out! I do think that the API is relatively simple.
My next question is just whether enough tools will be able to gain benefit from this to justify the complexity. It primarily impacts fmt
ers and potentially (batched) dependency inference.
Need to measure this claim: "My hope here is the overhead introduced by plugin authors writing unconditional code to create little sandbox digests is negligible."
It looks like it is when compared to code which is 1) already batching, 2) doesn't use dependencies... which is great. It's likely to be more awkward in cases where a tool uses dependencies (since you must do independent graph walks per root). But that is effectively identical to what recursive @rule
s (generally: compilers) need to do, so not too weird.
Should we ensure there's no overlap in SandboxInfo's output_files? I think so
Yea, probably: overlap would always be a bug.
if use_coalesced_process_batch: | ||
return await Get(FallibleProcessResult, CoalescedProcessBatch, request) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From a safety perspective, it might be the case that when batching is disabled, we should instead run file-at-a-time (as that's the easiest way to flush out bugs in having fully declared the dependencies of a particular sub-process). For example: if a file necessary for process A
is only included by subprocess B
, then it will be missing from A
's cache key, but we will not detect that, because when they do not hit the cache, they will run together here.
Although I suppose that you might still see that bug after a partial-cache hit where you hit the cache for B
but not A
, and then fail when you try to run A
independently...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually... hm. Either batched or unbatched could expose bugs, since both the positive and negative case of a file existing might matter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hope here is the config disabled exhibits the behavior we have today of batching. I think per-file would be worse experience (which is why we removed it in lieu of batching).
That being said I don't have a good way to find bugs other than some clever testing 🤔
I think formatters and linters and batched dep inference are all excellent candidates, as the linchpin here is "can we get away with no stdout/stderr on success". |
@rule | ||
async def run_maybe_coalesced_process_batch( | ||
request: MaybeCoalescedProcessBatch, | ||
use_coalesced_process_batch: ExperimentalCoalescedProcessBatchingOption, | ||
) -> FallibleProcessResult: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm actually thinking this might be feasible as a @rule_helper
and we strike it just right, can "do the right thing". E.g. do the normal mega-await
in the normal case, or the mini-awaits in the split case.
[ci skip-build-wheels]
Hey, sorry for the very tardy comments on this. I definitely think this is an important direction to be going down. Users may want to make different performance vs. theoretical correctness tradeoffs, and this is the kind of thing that will support that. Regarding the implementation though, I'm not sure yet if I think a synthetic Process object is the way to go. It seems potentially troublesome to cache a Process as if it ran when it didn't. The direction I was imagining we could go here is a little different, and it is to differentiate "processes" from "facts", where what we cache is "facts". One obvious example of a fact is "running this Process produces this output". Those are the only facts we cache today. But there can be other facts. For example, "foobar.py passes this linter with this config". We might be able to establish (and cache) that fact by running a Process on a batch of files that includes foobar.py. This may seem like hair-splitting on naming, since the synthetic Process is basically the fact abstraction. But it's not entirely. Now I will grant that, done without sufficient care, this opens the door for underspecified cache keys, since it would be easy to omit a Process field that does turn out to matter from the fact key. So maybe a fact is tightly bound to a Process after all, or at least by default. I could be convinced. In which case this is mostly about naming. But even if this is just naming, naming matters. We do want to create a clear distinction between "a process that actually ran" and "some information the user cares about that we learned from that run". |
Given that @Eric-Arellano will be starting on #12203 soon, which will definitely be adjusting the Using include-list based "facts" rather than exclude-list was very error prone in v1. IMO, we should carefully poke holes in the safety wall, rather than starting with the minimum wall we can identify and adding bricks as we discover bugs. |
That's fair, but OTOH caching lies ("this process ran with these outputs") also seems like a recipe for hard-to-debug errors. At the very least, anything we cache should be annotated with whether it was an actual Process that ran or a synthetic one that was generated from the results of some other processes (I say "annotated" because obviously that field can't be part of the cache key). And at that point what I'm suggesting by way of that annotation is that we name the thing we cache appropriately, even if it wraps a full Process object for safety, it isn't really the same as one, semantically. |
Perhaps. But that "do not let a coalesced Process collide with a normal Process" can be accomplished by adding an environment variable or some other basic marker to a coalesced Process, rather than by using a separate cache API. |
!! Opening draft PR as a WIP and RFC !!
This PR introduces one major change, which has many little plumbing changes associated. The ultimate goal is to have the cache be populated per-file when running a process for maximum cacheability, but still run processes on batches of files for performance.
The result being
./pants fmt ::
will have a cache entry per tool per file. Therefore running./pants fmt path/to/file.py
immediately after is fully cached.We accomplish this by:
experimental_coalesced_process_batching
defaulting toFalse
so that user can opt-in. This is important as we're blurring the lines which many users might not want, especially in CI.MaybeCoalescedProcessBatch
(which is just aCoalescedProcessBatch
). This type is made up of a set of "common" process args shared between all to-be-coalesced processes (likeenv
) and a mapping from filename to a newSandboxInfo
type. The new type holds the input/output info, and should be instantiated for a file using values that would be used if the user only specified the single file for the goalTrue
.CoalescedProcessBatch
and:argv
andInputDigests
to make one coalesced process.output_files
andoutput_directories
.TODOs:
CoalescedVenvPexProcessBatch
into the associated typeSandboxInfo
'soutput_file
s? I think so[ci skip-build-wheels]