Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable running multiple processes in the same sandbox. #17184

Open
tgolsson opened this issue Oct 10, 2022 · 11 comments
Open

Enable running multiple processes in the same sandbox. #17184

tgolsson opened this issue Oct 10, 2022 · 11 comments

Comments

@tgolsson
Copy link
Contributor

tgolsson commented Oct 10, 2022

Is your feature request related to a problem? Please describe.

To run an OCI image locally with runc one needs to unpack the rootfs onto the local file system, and then execute runc inside that fs. Since this filesystem contains symlinks and so on, pants barfs trying to output it from a process. It's also wasteful, since the fs is quite large and only used during the run it makes more sense to use unpack + run into a single sandbox with no outputs.

However; there's no built-in abstraction for this in Pants right now.

Describe the solution you'd like

I'd like a built-in mechanic for running multiple processes with a fused input/output. I implemented a prototype to solve my immediate problem here:

https://github.com/tgolsson/pants-backends/blob/main/pants-plugins/oci/pants_backend_oci/tools/process.py

Describe alternatives you've considered
I can keep my variant, just seems like a useful building block for Pants to have "natively". As discussed in the below Slack thread one could also run this as multiple processes when/if absolute symlinks are supported, but it'd waste cache space.

Additional context
Discussion where I first hit this issue: https://pantsbuild.slack.com/archives/C01CQHVDMMW/p1665338251230869

@tgolsson
Copy link
Contributor Author

(And my approach shouldn't be upstreamed (IMO) because it doesn't have very good UX in that it also has to merge the descriptions. OTOH, perfect is the enemy of good enough, or something like that.)

@thejcannon
Copy link
Member

FWIW symlink support is coming down the line (no eta).

Additionally for this use-case we'd usually advocate for just using a shell script or similar as your process.

@tgolsson
Copy link
Contributor Author

I'm using a shell-script that's auto-generated from a bunch of processes - noticed the link had died, updated. I think the downside is that it also requires merging the descriptions etc. And for this very specific application; it'd be wasteful to have separate sandboxes since it'd just duplicate the docker image as separate files.

@thejcannon
Copy link
Member

Is this just a performance issue, then? On the horizon we're playing with symlinks for immutable (large, by some definition) inputs.

Otherwise, your request is tricky because pants is written with support for remote execution, which the remote execution API AFAICT doesn't support this feature.

@thejcannon
Copy link
Member

See #17282

@tgolsson
Copy link
Contributor Author

tgolsson commented Oct 24, 2022

Ok had a think; so I think there's three issues here from my perspective.

  • Symlinks do not work fully in sandboxes requiring a workaround
  • Even if symlinks worked, it'd be wasteful to copy the contents since it's ephemeral and cannot be consumed by any other goal
  • Even discounting the waste, there is a conceptual disconnect here: the first step is really only preparation for the actual work which is running the container.

My workaround solves points 1 and 2 but the UX isn't optimal for 3 since I have to merge the descriptions; thus making it harder to understand what is being done and how long each step takes.

Same thing happens in my build step fwiw; I want to:

  • put X number of packages into the container
  • configure entrypoint and cmd
  • maybe set up environment
  • maybe tag as a layer
  • maybe compress/prune/whiteout/...

Each one being an invocation of umoci. Neither step producing a "complete" item.

@benjyw
Copy link
Contributor

benjyw commented Nov 25, 2022

cc @chrisjrn as this may relate to some of the work on experimental_shell_command

@chrisjrn
Copy link
Contributor

I'm not completely sold on there being a need for this beyond writing a shell script and using experimental_shell_command -- we generally like for each target to represent a complete "build step" that is meaningfully cacheable and usable by multiple dependent targets. It sounds like your use case is for a set of sequential processes, where the intermediate targets are only used by the next target in a sequence. Is that right? Is the goal here to get better logging of each of the intermediate steps?

@tgolsson
Copy link
Contributor Author

tgolsson commented Nov 28, 2022

I don't think experimental_shell_command is at all relevant for developing plugins, but maybe I'm wrong. I think logging is an important part, but also removing the boilerplate-y side of taking what might be a few process invocations and building a script from that. If that is the intended solution of doing multiple steps in one sandbox; can that be made easier for plugin developers (discoverable, too!) and clearer for end-users (logging!)?

@tgolsson
Copy link
Contributor Author

tgolsson commented Nov 28, 2022

Had another think and wanted to elaborate a bit further because I glanced over the question regarding my use-case.

I think there are two distinct cases where this can be used, one where it is an anti-pattern, and one where it isn't. So, I think I'm incorrectly using this to build-and-configure an image in one go. Like here:

        compile_result = await Get(
            FallibleProcessResult,
            FusedProcess(
                (
                    Process(
                        (
                            umoci.exe,
                            "raw",
                            "add-layer",
                            ...
                            "layers/image_bundle.tar",
                        ),
                        input_digest=command_digest,
                        description=f"Package OCI Image Bundle: {request.target.address}",
                    ),
                    Process(
                        (
                            umoci.exe,
                            "config",
                            ...
                        ),
                        input_digest=command_digest,
                        description=f"Configure OCI Image: {request.target.address}",
                        output_directories=("build",),
                    ),
                ),
            ),
        )

Here I'm adding a layer to an image (process 1) and then configuring that "atomically" (process 2). This means that if I change the environment I'll have to rebuild the contents of that layer. Bad! I can split that into two distinct steps and get much faster configure times, at the cost of extra complexity and potentially increased cache use. Important consideration, but I think my current choice is suboptimal.

So; case two:

    return await Get(
        Process,
        FusedProcess(
            (
                packed_image_process,
                mkdir_process,
                Process(
                    command,
                    description=f"Running {request.target}",
                    input_digest=tool.digest,
                ),
            )
        ),
    )

Here I would say I'm using it properly - this is the case where the symlinks break if I cache it, but fundamentally the "unpacked" image is a dead-end build-wise. Caching the unpacked is going to waste space for almost no benefit, and the only thing that can be done here is to run it. Maybe we can run it with different args and keep it unpacked; but I don't see much benefit to that. (Let's ignore my usage of a mkdir_process, which should clearly be a CreateDigest, I've learned. ;-))

So maybe what I'm thinking about is that there's 1..N steps that sometimes have to happen "to get ready" and "finish up", just like I make coffee before I start working in the morning, and put away my cup at the end of the day. The coffee is just an aside to the coding goal - it's only important to me.

@Eric-Arellano
Copy link
Contributor

One workaround is to create a bash script that invokes both processes, then invoke your bash script. For example, we use a bash script to invoke Go Processes:

@rule
async def go_sdk_invoke_setup(goroot: GoRoot) -> GoSdkRunSetup:
# Note: The `go` tool requires GOPATH to be an absolute path which can only be resolved
# from within the execution sandbox. Thus, this code uses a bash script to be able to resolve
# absolute paths inside the sandbox.
go_run_script = FileContent(
"__run_go.sh",
textwrap.dedent(
f"""\
export GOROOT={goroot.path}
sandbox_root="$(/bin/pwd)"
export GOPATH="${{sandbox_root}}/gopath"
export GOCACHE="${{sandbox_root}}/cache"
/bin/mkdir -p "$GOPATH" "$GOCACHE"
if [ -n "${GoSdkRunSetup.CHDIR_ENV}" ]; then
cd "${GoSdkRunSetup.CHDIR_ENV}"
fi
if [ -n "${GoSdkRunSetup.SANDBOX_ROOT_ENV}" ]; then
args=("${{@//__PANTS_SANDBOX_ROOT__/$sandbox_root}}")
set -- "${{args[@]}}"
fi
exec "{goroot.path}/bin/go" "$@"
"""
).encode("utf-8"),
)
digest = await Get(Digest, CreateDigest([go_run_script]))
return GoSdkRunSetup(digest, go_run_script)
@rule
async def setup_go_sdk_process(
request: GoSdkProcess,
go_sdk_run: GoSdkRunSetup,
bash: BashBinary,
golang_env_aware: GolangSubsystem.EnvironmentAware,
goroot: GoRoot,
) -> Process:
input_digest, env_vars = await MultiGet(
Get(Digest, MergeDigests([go_sdk_run.digest, request.input_digest])),
Get(
EnvironmentVars,
EnvironmentVarsRequest(golang_env_aware.env_vars_to_pass_to_subprocesses),
),
)
maybe_replace_sandbox_root_env = (
{GoSdkRunSetup.SANDBOX_ROOT_ENV: "1"} if request.replace_sandbox_root_in_args else {}
)
return Process(
argv=[bash.path, go_sdk_run.script.path, *request.command],
env={
**env_vars,
**request.env,
GoSdkRunSetup.CHDIR_ENV: request.working_dir or "",
**maybe_replace_sandbox_root_env,
# TODO: Maybe could just use MAJOR.MINOR for version part here?
"__PANTS_GO_SDK_CACHE_KEY": f"{goroot.version}/{goroot.goos}/{goroot.goarch}",
},
input_digest=input_digest,
description=request.description,
output_files=request.output_files,
output_directories=request.output_directories,
level=LogLevel.DEBUG,
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants