Enable running multiple processes in the same sandbox. #17184

tgolsson · 2022-10-10T20:57:05Z

Is your feature request related to a problem? Please describe.

To run an OCI image locally with runc one needs to unpack the rootfs onto the local file system, and then execute runc inside that fs. Since this filesystem contains symlinks and so on, pants barfs trying to output it from a process. It's also wasteful, since the fs is quite large and only used during the run it makes more sense to use unpack + run into a single sandbox with no outputs.

However; there's no built-in abstraction for this in Pants right now.

Describe the solution you'd like

I'd like a built-in mechanic for running multiple processes with a fused input/output. I implemented a prototype to solve my immediate problem here:

https://github.com/tgolsson/pants-backends/blob/main/pants-plugins/oci/pants_backend_oci/tools/process.py

Describe alternatives you've considered
I can keep my variant, just seems like a useful building block for Pants to have "natively". As discussed in the below Slack thread one could also run this as multiple processes when/if absolute symlinks are supported, but it'd waste cache space.

Additional context
Discussion where I first hit this issue: https://pantsbuild.slack.com/archives/C01CQHVDMMW/p1665338251230869

tgolsson · 2022-10-12T08:45:35Z

(And my approach shouldn't be upstreamed (IMO) because it doesn't have very good UX in that it also has to merge the descriptions. OTOH, perfect is the enemy of good enough, or something like that.)

thejcannon · 2022-10-24T10:58:08Z

FWIW symlink support is coming down the line (no eta).

Additionally for this use-case we'd usually advocate for just using a shell script or similar as your process.

tgolsson · 2022-10-24T11:04:34Z

I'm using a shell-script that's auto-generated from a bunch of processes - noticed the link had died, updated. I think the downside is that it also requires merging the descriptions etc. And for this very specific application; it'd be wasteful to have separate sandboxes since it'd just duplicate the docker image as separate files.

thejcannon · 2022-10-24T11:10:10Z

Is this just a performance issue, then? On the horizon we're playing with symlinks for immutable (large, by some definition) inputs.

Otherwise, your request is tricky because pants is written with support for remote execution, which the remote execution API AFAICT doesn't support this feature.

thejcannon · 2022-10-24T11:10:45Z

See #17282

tgolsson · 2022-10-24T12:13:19Z

Ok had a think; so I think there's three issues here from my perspective.

Symlinks do not work fully in sandboxes requiring a workaround
Even if symlinks worked, it'd be wasteful to copy the contents since it's ephemeral and cannot be consumed by any other goal
Even discounting the waste, there is a conceptual disconnect here: the first step is really only preparation for the actual work which is running the container.

My workaround solves points 1 and 2 but the UX isn't optimal for 3 since I have to merge the descriptions; thus making it harder to understand what is being done and how long each step takes.

Same thing happens in my build step fwiw; I want to:

put X number of packages into the container
configure entrypoint and cmd
maybe set up environment
maybe tag as a layer
maybe compress/prune/whiteout/...

Each one being an invocation of umoci. Neither step producing a "complete" item.

benjyw · 2022-11-25T21:58:10Z

cc @chrisjrn as this may relate to some of the work on experimental_shell_command

chrisjrn · 2022-11-28T16:52:34Z

I'm not completely sold on there being a need for this beyond writing a shell script and using experimental_shell_command -- we generally like for each target to represent a complete "build step" that is meaningfully cacheable and usable by multiple dependent targets. It sounds like your use case is for a set of sequential processes, where the intermediate targets are only used by the next target in a sequence. Is that right? Is the goal here to get better logging of each of the intermediate steps?

tgolsson · 2022-11-28T17:04:15Z

I don't think experimental_shell_command is at all relevant for developing plugins, but maybe I'm wrong. I think logging is an important part, but also removing the boilerplate-y side of taking what might be a few process invocations and building a script from that. If that is the intended solution of doing multiple steps in one sandbox; can that be made easier for plugin developers (discoverable, too!) and clearer for end-users (logging!)?

tgolsson · 2022-11-28T21:56:17Z

Had another think and wanted to elaborate a bit further because I glanced over the question regarding my use-case.

I think there are two distinct cases where this can be used, one where it is an anti-pattern, and one where it isn't. So, I think I'm incorrectly using this to build-and-configure an image in one go. Like here:

        compile_result = await Get(
            FallibleProcessResult,
            FusedProcess(
                (
                    Process(
                        (
                            umoci.exe,
                            "raw",
                            "add-layer",
                            ...
                            "layers/image_bundle.tar",
                        ),
                        input_digest=command_digest,
                        description=f"Package OCI Image Bundle: {request.target.address}",
                    ),
                    Process(
                        (
                            umoci.exe,
                            "config",
                            ...
                        ),
                        input_digest=command_digest,
                        description=f"Configure OCI Image: {request.target.address}",
                        output_directories=("build",),
                    ),
                ),
            ),
        )

Here I'm adding a layer to an image (process 1) and then configuring that "atomically" (process 2). This means that if I change the environment I'll have to rebuild the contents of that layer. Bad! I can split that into two distinct steps and get much faster configure times, at the cost of extra complexity and potentially increased cache use. Important consideration, but I think my current choice is suboptimal.

So; case two:

    return await Get(
        Process,
        FusedProcess(
            (
                packed_image_process,
                mkdir_process,
                Process(
                    command,
                    description=f"Running {request.target}",
                    input_digest=tool.digest,
                ),
            )
        ),
    )

Here I would say I'm using it properly - this is the case where the symlinks break if I cache it, but fundamentally the "unpacked" image is a dead-end build-wise. Caching the unpacked is going to waste space for almost no benefit, and the only thing that can be done here is to run it. Maybe we can run it with different args and keep it unpacked; but I don't see much benefit to that. (Let's ignore my usage of a mkdir_process, which should clearly be a CreateDigest, I've learned. ;-))

So maybe what I'm thinking about is that there's 1..N steps that sometimes have to happen "to get ready" and "finish up", just like I make coffee before I start working in the morning, and put away my cup at the end of the day. The coffee is just an aside to the coding goal - it's only important to me.

Eric-Arellano · 2022-11-30T16:40:30Z

One workaround is to create a bash script that invokes both processes, then invoke your bash script. For example, we use a bash script to invoke Go Processes:

pants/src/python/pants/backend/go/util_rules/sdk.py

Lines 72 to 135 in 2449ff4

    
           @rule 
        
           async def go_sdk_invoke_setup(goroot: GoRoot) -> GoSdkRunSetup: 
        
               # Note: The `go` tool requires GOPATH to be an absolute path which can only be resolved 
        
               # from within the execution sandbox. Thus, this code uses a bash script to be able to resolve 
        
               # absolute paths inside the sandbox. 
        
               go_run_script = FileContent( 
        
                   "__run_go.sh", 
        
                   textwrap.dedent( 
        
                       f"""\ 
        
                       export GOROOT={goroot.path} 
        
                       sandbox_root="$(/bin/pwd)" 
        
                       export GOPATH="${{sandbox_root}}/gopath" 
        
                       export GOCACHE="${{sandbox_root}}/cache" 
        
                       /bin/mkdir -p "$GOPATH" "$GOCACHE" 
        
                       if [ -n "${GoSdkRunSetup.CHDIR_ENV}" ]; then 
        
                         cd "${GoSdkRunSetup.CHDIR_ENV}" 
        
                       fi 
        
                       if [ -n "${GoSdkRunSetup.SANDBOX_ROOT_ENV}" ]; then 
        
                         args=("${{@//__PANTS_SANDBOX_ROOT__/$sandbox_root}}") 
        
                         set -- "${{args[@]}}" 
        
                       fi 
        
                       exec "{goroot.path}/bin/go" "$@" 
        
                       """ 
        
                   ).encode("utf-8"), 
        
               ) 
        
               digest = await Get(Digest, CreateDigest([go_run_script])) 
        
               return GoSdkRunSetup(digest, go_run_script) 
        
           @rule 
        
           async def setup_go_sdk_process( 
        
               request: GoSdkProcess, 
        
               go_sdk_run: GoSdkRunSetup, 
        
               bash: BashBinary, 
        
               golang_env_aware: GolangSubsystem.EnvironmentAware, 
        
               goroot: GoRoot, 
        
           ) -> Process: 
        
               input_digest, env_vars = await MultiGet( 
        
                   Get(Digest, MergeDigests([go_sdk_run.digest, request.input_digest])), 
        
                   Get( 
        
                       EnvironmentVars, 
        
                       EnvironmentVarsRequest(golang_env_aware.env_vars_to_pass_to_subprocesses), 
        
                   ), 
        
               ) 
        
               maybe_replace_sandbox_root_env = ( 
        
                   {GoSdkRunSetup.SANDBOX_ROOT_ENV: "1"} if request.replace_sandbox_root_in_args else {} 
        
               ) 
        
               return Process( 
        
                   argv=[bash.path, go_sdk_run.script.path, *request.command], 
        
                   env={ 
        
                       **env_vars, 
        
                       **request.env, 
        
                       GoSdkRunSetup.CHDIR_ENV: request.working_dir or "", 
        
                       **maybe_replace_sandbox_root_env, 
        
                       # TODO: Maybe could just use MAJOR.MINOR for version part here? 
        
                       "__PANTS_GO_SDK_CACHE_KEY": f"{goroot.version}/{goroot.goos}/{goroot.goarch}", 
        
                   }, 
        
                   input_digest=input_digest, 
        
                   description=request.description, 
        
                   output_files=request.output_files, 
        
                   output_directories=request.output_directories, 
        
                   level=LogLevel.DEBUG, 
        
               )

tgolsson added the enhancement label Oct 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable running multiple processes in the same sandbox. #17184

Enable running multiple processes in the same sandbox. #17184

tgolsson commented Oct 10, 2022 •

edited

Loading

tgolsson commented Oct 12, 2022

thejcannon commented Oct 24, 2022

tgolsson commented Oct 24, 2022

thejcannon commented Oct 24, 2022

thejcannon commented Oct 24, 2022

tgolsson commented Oct 24, 2022 •

edited

Loading

benjyw commented Nov 25, 2022

chrisjrn commented Nov 28, 2022

tgolsson commented Nov 28, 2022 •

edited

Loading

tgolsson commented Nov 28, 2022 •

edited

Loading

Eric-Arellano commented Nov 30, 2022

Enable running multiple processes in the same sandbox. #17184

Enable running multiple processes in the same sandbox. #17184

Comments

tgolsson commented Oct 10, 2022 • edited Loading

tgolsson commented Oct 12, 2022

thejcannon commented Oct 24, 2022

tgolsson commented Oct 24, 2022

thejcannon commented Oct 24, 2022

thejcannon commented Oct 24, 2022

tgolsson commented Oct 24, 2022 • edited Loading

benjyw commented Nov 25, 2022

chrisjrn commented Nov 28, 2022

tgolsson commented Nov 28, 2022 • edited Loading

tgolsson commented Nov 28, 2022 • edited Loading

Eric-Arellano commented Nov 30, 2022

tgolsson commented Oct 10, 2022 •

edited

Loading

tgolsson commented Oct 24, 2022 •

edited

Loading

tgolsson commented Nov 28, 2022 •

edited

Loading

tgolsson commented Nov 28, 2022 •

edited

Loading