-
-
Notifications
You must be signed in to change notification settings - Fork 636
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use immutable_inputs for PEX
s
#14070
Comments
repository.pex
PEX
s
This is currently blocked on #13899. |
#13899 has been stably fixed for a while, so we should be good to do this now. |
Is this too risky to backport to 2.13? It sounds like very low hanging performance fruit |
Depends how large the patch was probably. But 2.13 is already pretty large (~7 weeks). |
@stuhood I'm interested in working on this one 👀 The only reference to |
Without fully answering your question (@Eric-Arellano would probably be better equipped for that), I'll say that: |
Oh nice - does that mean this is no longer relevant since non-PEX lockfiles are being deprecated? |
(Just thinking "out loud") I see that |
What are the downsides of using |
Correct.
The overhead of something being used as an immutable input is very low, but not zero: so if something is always used exactly once as an input, it won't be worth it to use it that way. But yes: I expect that almost all PEXes will make sense to provide as One other consideration before starting this would just be whether the API of |
Maybe starting with tool PEXes would make sense? For almost every workflow, we materialize the same PEX multiple times: I suspect we also materialize the same Those tool PEXes all get built using this helper: pants/src/python/pants/backend/python/subsystems/python_tool_base.py Lines 212 to 228 in 8b03d13
While I agree with the change, I don't think we should block on it. This is an internal API mostly, and it's not very costly for us to change now vs in a month or two. Sounds like neither you or me have time this month to do this, so it'd be better to not block @danxmoran making an awesome performance improvement.
So, |
Because it's one of those things that belong in the giant (useful) hack list. Just like using Pex for its packed layout instead of using Pip and venvs directly. We really need to keep clear eyes about where our hacks are and why and when we need them. Packed layout reduces most tool venvs to O(10) files (zips) I think, so the immutable inputs hack is only likely useful for user packed PEXes with O(100) files. In other words, our hacks here are:
|
Thanks for explaining that, John. Iiuc, it's not only the # of files that is an issue, but the size of those files? Symlinking a large file is much faster than copying it? So, even though tool PEXes are roughly O(10), we may still get a speed up? |
I would consider blocking on it, just because I'm sure that @danxmoran is capable of doing it, and doing it would drastically reduce the API surface area of the change. See #13862 for an example of converting the JVM to use With the altered API, using |
Yes, but tools are small in practice in addition to having small dependency sets. So again we're only likely to hit this in user packed PEXes. |
I've opened #17282 to cover the API change that I think would be worth making before working on this. It's also possible that depending on the heuristic that is used there (i.e. if it is size based), then there wouldn't actually be any PEX/Python-specific work to do here. |
Fixes #17282 and fixes #14070 This change represents the smallest footprint change to get support in for treating "large" files as immutable inputs. - `immutable_inputs.rs` has been moved to `store` (to avoid circular reference) - An additional method was added to support a hardlink _file_ - Directory materialization takes an `ImmutableInputs` ref and a list of paths to ensure mutable - When materializing a file, if its above our threshold and not being forced mutable, we hardlink it to the immutable inputs - Process running seeds the mutable paths with the capture outputs The future is primed for changes like: - Eventually removing the `immutable_input_digests` to a process, and letting the heuristic take over - And then cleaning the code up after that's ripped out - Adding more facilities to includelist/excludelist files from a `Process` object (e.g. we could includelist most/all PEXs since those shouldn't be mutated and we'd just have one top-level hardlink) - Have a directory huerstic - IDK more shenanigans 😄 Tested 3 ways: - `./pants --keep-sandboxes=always <something>` and inspected the sandbox between 2 different runs using the same daemon and ensured the hardlink - Crafted an `experimental_shell_command` with a file in `outputs` that matches a large file and ensured the file in the sandbox wasn't hardlinked - Crafted an `experimental_shell_command` with a dir in `outputs` that matches the containing dir of a large file and ensured the file in the sandbox wasn't hardlinked
#13848 (designed in #12716) added support for immutable process inputs, which are symlinked into the sandbox, rather than copied there. Once immutable inputs have stabilized (notably, once #13899 is fixed), we should provide the
repository.pex
as an immutable input to significantly reduce IO.It's also possible that in most cases where a PEX is used as an input to a process, that supplying it as an immutable input would make sense: in particular, cases where a PEX contains only thirdparty code, and firstparty code is materialized as loose files.
The text was updated successfully, but these errors were encountered: