Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include/exclude and heuristic-based immutable_inputs #17282

Open
stuhood opened this issue Oct 19, 2022 · 2 comments · Fixed by #17520
Open

Include/exclude and heuristic-based immutable_inputs #17282

stuhood opened this issue Oct 19, 2022 · 2 comments · Fixed by #17520

Comments

@stuhood
Copy link
Member

stuhood commented Oct 19, 2022

Process.immutable_inputs is currently configured as a mapping from a file path to a Digest to materialize at that path. It is merged into the input_digest of a Process to form a "complete" Digest of inputs.

But this separation is a bit unfortunate, because it means that using immutable_inputs (which should be fairly widely used: see discussion on #14070) means a lot of change to how your Process is constructed.


From an API-changes perspective, we should:

  1. remove the guarantee that inputs are mutable by default, to allow us to automatically choose the strategy that we use for each path, and symlink wherever we choose to.
    • Then, by default, we should automatically use symlinking for files or directories over a certain size, etc.
  2. convert the existing immutable_inputs list into a flat list of paths which should be forced/includelisted to be materialized as immutable (regardless of the heuristic from above), but which must already exist in the input_digest (maybe "force" should go in the name? force_immutable_paths= ...?).
    • This will mean that unlike today (where you need to change how your Process is constructed to stop adding an input to the input_digests and instead add it to the immutable_inputs as a dict), you would only need to add a path to the immutable_inputs, without changing your input_digest.
  3. add an mutable_inputs (or force_mutable_paths=...?) flat list of paths to act as an excludelist that prevents a path from being materialized as a symlink.

From an implementation perspective, this would look like moving the support for creating the symlinks for ImmutableInputs from an explicit step before running a process:

// Collect the symlinks to create for immutable inputs or named caches.
let immutable_inputs_symlinks = {
let symlinks = immutable_inputs
.local_paths(&req.input_digests.immutable_inputs)
.await?;
match immutable_inputs_prefix {
Some(prefix) => symlinks
.into_iter()
.map(|symlink| WorkdirSymlink {
src: symlink.src,
dst: prefix.join(
symlink
.dst
.strip_prefix(immutable_inputs.workdir())
.unwrap(),
),
})
.collect::<Vec<_>>(),
None => symlinks,
}
};

...to a thing that happens inside fn materialize_directory when an includelisted path is encountered:
///
/// Lays out the directory and all of its contents (files and directories) on disk so that a
/// process which uses the directory structure can run.
///
/// Although `Directory` has internally unique paths, `materialize_directory` can be used with
/// an existing destination directory, meaning that directory and file creation must be
/// idempotent.
///
pub async fn materialize_directory(
&self,
destination: PathBuf,
digest: DirectoryDigest,
perms: Permissions,
) -> Result<(), StoreError> {

materialize_directory(_helper) would take either:

  1. a reference to a collection of paths which were includelisted as immutable
  2. a trait or function which implemented the heuristic for which directory::Entrys / paths to includelist

...and a reference to ImmutableInputs, and then would create a symlink to the ImmutableInputs whenever it encountered one of the includelisted paths.

@thejcannon
Copy link
Member

thejcannon commented Oct 24, 2022

In the most common case (fixers) couldn't we deduce the mutable paths by the intersection of the input and output paths?

The output paths for any process will either a) not exist yet or b) do exist (from the inputs) and therefore need to be mutable (or else why are you capturing them)

@stuhood
Copy link
Member Author

stuhood commented Nov 7, 2022

In the most common case (fixers) couldn't we deduce the mutable paths by the intersection of the input and output paths?

The output paths for any process will either a) not exist yet or b) do exist (from the inputs) and therefore need to be mutable (or else why are you capturing them)

Yea, for sure.

@thejcannon thejcannon self-assigned this Nov 8, 2022
thejcannon added a commit that referenced this issue Dec 6, 2022
Fixes #17282 and fixes #14070

This change represents the smallest footprint change to get support in for treating "large" files as immutable inputs.

- `immutable_inputs.rs` has been moved to `store` (to avoid circular reference)
- An additional method was added to support a hardlink _file_
- Directory materialization takes an `ImmutableInputs` ref and a list of paths to ensure mutable
- When materializing a file, if its above our threshold and not being forced mutable, we hardlink it to the immutable inputs
- Process running seeds the mutable paths with the capture outputs

The future is primed for changes like:
- Eventually removing the `immutable_input_digests` to a process, and letting the heuristic take over
- And then cleaning the code up after that's ripped out
- Adding more facilities to includelist/excludelist files from a `Process` object (e.g. we could includelist most/all PEXs since those shouldn't be mutated and we'd just have one top-level hardlink)
- Have a directory huerstic
- IDK more shenanigans 😄 

Tested 3 ways:
- `./pants --keep-sandboxes=always <something>` and inspected the sandbox between 2 different runs using the same daemon and ensured the hardlink
- Crafted an `experimental_shell_command` with a file in `outputs` that matches a large file and ensured the file in the sandbox wasn't hardlinked
- Crafted an `experimental_shell_command` with a dir in `outputs` that matches the containing dir of a large file and ensured the file in the sandbox wasn't hardlinked
@stuhood stuhood reopened this May 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants