Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use immutable_inputs for PEXs #14070

Closed
stuhood opened this issue Jan 4, 2022 · 16 comments · Fixed by #17520
Closed

Use immutable_inputs for PEXs #14070

stuhood opened this issue Jan 4, 2022 · 16 comments · Fixed by #17520
Labels

Comments

@stuhood
Copy link
Member

stuhood commented Jan 4, 2022

#13848 (designed in #12716) added support for immutable process inputs, which are symlinked into the sandbox, rather than copied there. Once immutable inputs have stabilized (notably, once #13899 is fixed), we should provide the repository.pex as an immutable input to significantly reduce IO.

It's also possible that in most cases where a PEX is used as an input to a process, that supplying it as an immutable input would make sense: in particular, cases where a PEX contains only thirdparty code, and firstparty code is materialized as loose files.

@stuhood stuhood changed the title Use immutable_inputs for repository.pex Use immutable_inputs for PEXs Jan 5, 2022
@stuhood
Copy link
Member Author

stuhood commented May 19, 2022

This is currently blocked on #13899.

@thejcannon thejcannon added the backend: Python Python backend-related issues label Jun 7, 2022
@stuhood
Copy link
Member Author

stuhood commented Jun 22, 2022

#13899 has been stably fixed for a while, so we should be good to do this now.

@Eric-Arellano
Copy link
Contributor

Is this too risky to backport to 2.13? It sounds like very low hanging performance fruit

@stuhood
Copy link
Member Author

stuhood commented Jun 23, 2022

Is this too risky to backport to 2.13? It sounds like very low hanging performance fruit

Depends how large the patch was probably. But 2.13 is already pretty large (~7 weeks).

@danxmoran
Copy link
Contributor

@stuhood I'm interested in working on this one 👀

The only reference to repository.pex I see is here. Walking backwards from there, the code jumps through various rules & helpers before landing on generalize_requirements_pex_request, which I see is invoked (via Get(RequirementsPexRequest, ...)) in a handful of modules under src/python/pants/backend/python/. Is the idea to update how the outputs of those invocations are used, so they're linked in as immutable inputs?

@stuhood
Copy link
Member Author

stuhood commented Aug 25, 2022

Without fully answering your question (@Eric-Arellano would probably be better equipped for that), I'll say that: repository.pex is no longer used with PEX lockfiles (only with non-PEX/native lockfiles), because #14127 moved to using PEX's lockfile subsetting support directly.

@danxmoran
Copy link
Contributor

Oh nice - does that mean this is no longer relevant since non-PEX lockfiles are being deprecated?

@danxmoran
Copy link
Contributor

(Just thinking "out loud") I see that VenvPexProcess doesn't have an immutable_input_digests field to match the plain Process - maybe this ticket evolves to not focus on repository.pex, but instead add that plumbing & ensure we're using it appropriately everywhere we can?

@Eric-Arellano
Copy link
Contributor

What are the downsides of using immutable_inputs, beyond code complexity? I'm wondering if we should use this ~every where? Like pytest_runner.pex for example.

@stuhood
Copy link
Member Author

stuhood commented Aug 25, 2022

maybe this ticket evolves to not focus on repository.pex, but instead add that plumbing & ensure we're using it appropriately everywhere we can?

Correct.

What are the downsides of using immutable_inputs, beyond code complexity? I'm wondering if we should use this ~every where? Like pytest_runner.pex for example.

The overhead of something being used as an immutable input is very low, but not zero: so if something is always used exactly once as an input, it won't be worth it to use it that way.

But yes: I expect that almost all PEXes will make sense to provide as immutable_inputs.


One other consideration before starting this would just be whether the API of immutable_inputs should be adjusted at all. We had briefly discussed adjusting it from a dict[path, Digest] to a set[path]... i.e., paths in the sandbox are marked as being materialized immutably, and are assumed to already exist in the input Digest for a process. The materialize_digest call in local process execution would then take the set[path], and when it encountered one of those paths, would materialize it as immutable rather than mutable (replacing this code).

@Eric-Arellano
Copy link
Contributor

But yes: I expect that almost all PEXes will make sense to provide as immutable_inputs.

Maybe starting with tool PEXes would make sense? For almost every workflow, we materialize the same PEX multiple times: pytest.pex, black.pex, and so on. That's because Pants often partitions things, like Pytest running one python_test at at time.

I suspect we also materialize the same requirements.pex multiple times, but it's less of a guarantee because there are theoretically as many requirements.pex as unique combinations of your third-party requirements.

Those tool PEXes all get built using this helper:

def to_pex_request(
self,
*,
interpreter_constraints: InterpreterConstraints | None = None,
extra_requirements: Iterable[str] = (),
main: MainSpecification | None = None,
sources: Digest | None = None,
) -> PexRequest:
return PexRequest(
output_filename=f"{self.options_scope.replace('-', '_')}.pex",
internal_only=True,
requirements=self.pex_requirements(extra_requirements=extra_requirements),
interpreter_constraints=interpreter_constraints or self.interpreter_constraints,
main=main,
sources=sources,
)

One other consideration before starting this would just be whether the API of immutable_inputs should be adjusted at all.

While I agree with the change, I don't think we should block on it. This is an internal API mostly, and it's not very costly for us to change now vs in a month or two. Sounds like neither you or me have time this month to do this, so it'd be better to not block @danxmoran making an awesome performance improvement.

does that mean this is no longer relevant since non-PEX lockfiles are being deprecated?

repository.pex will still exist because we will continue to support manually-managed lockfiles using requirements.txt style. There, we will use repository.pex. We only are deprecating Pants generating requirements.txt-style locks via Poetry with the generate-lockfiles goal, given how many issues there are like [python-repos] not being hooked up.

So, repository.pex will still exist, just most users won't trigger that codepath, so it's lower priority.

@jsirois
Copy link
Contributor

jsirois commented Aug 25, 2022

What are the downsides of using immutable_inputs

Because it's one of those things that belong in the giant (useful) hack list. Just like using Pex for its packed layout instead of using Pip and venvs directly. We really need to keep clear eyes about where our hacks are and why and when we need them. Packed layout reduces most tool venvs to O(10) files (zips) I think, so the immutable inputs hack is only likely useful for user packed PEXes with O(100) files.

In other words, our hacks here are:

  1. Use packed PEXes to reduce O(10k) files to O(10-100) files because materializing sandboxes is slow.
  2. Use immutable inputs to further hack that and bring down the materialization of the O(10-100) files to a 1 time event per pantsd restart.

@Eric-Arellano
Copy link
Contributor

Thanks for explaining that, John. Iiuc, it's not only the # of files that is an issue, but the size of those files? Symlinking a large file is much faster than copying it? So, even though tool PEXes are roughly O(10), we may still get a speed up?

@stuhood
Copy link
Member Author

stuhood commented Aug 25, 2022

One other consideration before starting this would just be whether the API of immutable_inputs should be adjusted at all.

While I agree with the change, I don't think we should block on it. This is an internal API mostly, and it's not very costly for us to change now vs in a month or two. Sounds like neither you or me have time this month to do this, so it'd be better to not block @danxmoran making an awesome performance improvement.

I would consider blocking on it, just because I'm sure that @danxmoran is capable of doing it, and doing it would drastically reduce the API surface area of the change. See #13862 for an example of converting the JVM to use immutable_inputs with the current API.

With the altered API, using immutable_inputs would instead amount to includelisting a few paths on the Process constructor.

@jsirois
Copy link
Contributor

jsirois commented Aug 25, 2022

So, even though tool PEXes are roughly O(10), we may still get a speed up?

Yes, but tools are small in practice in addition to having small dependency sets. So again we're only likely to hit this in user packed PEXes.

@stuhood
Copy link
Member Author

stuhood commented Oct 19, 2022

I've opened #17282 to cover the API change that I think would be worth making before working on this. It's also possible that depending on the heuristic that is used there (i.e. if it is size based), then there wouldn't actually be any PEX/Python-specific work to do here.

thejcannon added a commit that referenced this issue Dec 6, 2022
Fixes #17282 and fixes #14070

This change represents the smallest footprint change to get support in for treating "large" files as immutable inputs.

- `immutable_inputs.rs` has been moved to `store` (to avoid circular reference)
- An additional method was added to support a hardlink _file_
- Directory materialization takes an `ImmutableInputs` ref and a list of paths to ensure mutable
- When materializing a file, if its above our threshold and not being forced mutable, we hardlink it to the immutable inputs
- Process running seeds the mutable paths with the capture outputs

The future is primed for changes like:
- Eventually removing the `immutable_input_digests` to a process, and letting the heuristic take over
- And then cleaning the code up after that's ripped out
- Adding more facilities to includelist/excludelist files from a `Process` object (e.g. we could includelist most/all PEXs since those shouldn't be mutated and we'd just have one top-level hardlink)
- Have a directory huerstic
- IDK more shenanigans 😄 

Tested 3 ways:
- `./pants --keep-sandboxes=always <something>` and inspected the sandbox between 2 different runs using the same daemon and ensured the hardlink
- Crafted an `experimental_shell_command` with a file in `outputs` that matches a large file and ensured the file in the sandbox wasn't hardlinked
- Crafted an `experimental_shell_command` with a dir in `outputs` that matches the containing dir of a large file and ensured the file in the sandbox wasn't hardlinked
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Development

Successfully merging a pull request may close this issue.

5 participants