Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

experimental_shell_command's sandbox and invalidation includes transitive dependencies of experimental_shell_command dependencies, not just outputs #17345

Closed
huonw opened this issue Oct 26, 2022 · 4 comments · Fixed by #17743
Assignees
Labels

Comments

@huonw
Copy link
Contributor

huonw commented Oct 26, 2022

Describe the bug

When a experimental_shell_command (:second) depends on another experimental_shell_command (:first), the sandbox for :second include the dependencies of :first, rather than just the output of :first.

https://gist.github.com/huonw/c55d0b387ed6030cda611898ee2d0361 provides a reproducer, where :first generates output.txt by 'consuming' input.txt (but does nothing with it), and :second includes a sleep to demonstrate when it is running. The archive package includes all the .txt files that ended in the sandbox for :second: I'd expect it to only be the direct output of :first (output.txt), not the input.txt dependency of :first.

BUILD file from that gist for convenience:

file(name="input", source="input.txt")
experimental_shell_command(
    name="first",
    command="""
    echo "doesn't affect output"
    echo contents > output.txt
    """,
    tools=["echo"],
    dependencies=[":input"],
    outputs=["output.txt"]
)

experimental_shell_command(
    name="second",
    command="""
    sleep 3 # make 'actually running' obvious
    """,
    dependencies=[":first"],
    tools=["sleep"],
    outputs=["*.txt"],
)
archive(name="archive", files=[":second"], format="zip")
git clone git@gist.github.com:c55d0b387ed6030cda611898ee2d0361.git
cd c55d0b387ed6030cda611898ee2d0361
./pants version # 2.13.0

# initial build (takes about 3 seconds to run :second, due to sleep)
./pants package ::

# check package output:
unzip -l dist/archive.zip # two files: input.txt output.txt

# validating the cache works as expected (~instantaneous)
./pants package ::

# BUG: change to :first dependencies, that doesn't affect output... 
echo 'new contents' > input.txt
# ... reruns :second, and thus takes ~3 seconds
./pants package ::

# EXPECTED: change to :first script that doesn't affect output...
sed -i '' 's/affect output/affect output still/' BUILD
# .... doesn't rerun :second, only :first
./pants package :: 

AFAICT, there's no way to break :second's importing of :input, e.g. adjusting to dependencies=[":first", "!:input"] doesn't do anything: the behaviour is the same (and using !!:input isn't supported).

The :input being file is just for convenience. This appears to apply to all other target types, e.g. a pex_binary or yet another experimental_shell_command.

I haven't checked how this behaves when depending on explicit codegen targets, only these adhoc experimental_shell_command ones.

Pants version
2.13.0

OS
macOS

Additional info

Background: how are we using experimental_shell_command to hit this?

We're using experimental_shell_command to try to bridge the gap between Pants supported code and unsupported code (JS/TS), as well as for adhoc codegen tasks (to avoid avoid having to write and maintain a plugin). This can result in 'long' chains of experimental_shell_commands that depend on each other (and other resources), e.g.:

# some app:
python_sources()
pex_binary(name="app", ...)

# codegen the schema itself:
experimental_shell_command(name="schema", dependencies=[":app"], outputs=["schema.json"], command="./app.pex export-schema > schema.json", tools=["python3.9"])

# set up the NPM dependencies:
experimental_shell_command(name="node_modules", outputs=["node_modules/**"], command="npm ci", ...)

# use the schema to do codegen or generate docs or whatever:
experimental_shell_command(name="codegen-from-schema", dependencies=[":node_modules", ":schema"], outputs=["codegen/**"], command="npm run do-codegen < schema.json", ...)
experimental_shell_command(name="docs-from-schema", dependencies=[":node_modules", ":schema"], outputs=["docs/**"], command="npm run generate-docs < schema.json", ...)

# continues with targets using :codegen-from-schema and :docs-from-schema... (`archive`, `experimental_shell_command` and `experimental_run_shell_command`)

The Python code that goes into pex_binary changes regularly and thus the PEX itself changes too, but the exported schema.json doesn't change so much (i.e. we're often refactoring/adding features/fixing bugs without changing the schema). In theory, if :app changes but the schema doesn't, :schema's dependees (:codegen-from-schema, :docs-from-schema) don't need to run, but those dependees pull in the app.pex file from :app, and thus do rerun. Rerunning can take significant time.

(See also: https://pantsbuild.slack.com/archives/C046T6T9U/p1666315386420799)

@huonw huonw added the bug label Oct 26, 2022
@benjyw benjyw self-assigned this Oct 26, 2022
@benjyw
Copy link
Contributor

benjyw commented Oct 27, 2022

Started a discussion here: #17368

@thejcannon
Copy link
Member

FWIW This is part of my "re-architect dependencies" proposal I'm hoping to work on late this year or Q1 next year 😄

@chrisjrn
Copy link
Contributor

#17680 and follow-up work will have some impact on this

@chrisjrn
Copy link
Contributor

chrisjrn commented Dec 7, 2022

@huonw Would love your input on #17743 -- I think this should solve your problem somewhat.

chrisjrn pushed a commit to chrisjrn/pants that referenced this issue Dec 7, 2022
…_command`

Replaces the `outputs` field, which has a usability issue demonstrated in pantsbuild#17345
chrisjrn pushed a commit that referenced this issue Dec 9, 2022
…tal_shell_command` (#17744)

Replaces the `outputs` field, which has a usability issue demonstrated in #17345.

Extraculated from #17743.
chrisjrn pushed a commit that referenced this issue Dec 9, 2022
As seen in #17345, `experimental_shell_command` has a confused notion of dependencies. There are the "runtime dependencies": those needed to successfully run the command (essentially input data or runnable binaries), and then there are the dependencies that are necessary to use the side-effects generated by the command. `esc` currently treats these identically, which means that the sandboxes for chains of commands get overly large, and their cache invalidations overly fine (due to so many dependencies).

This adds a `SpecialCasedDependencies` field to `experimental_shell_command`, called `execution_dependencies`. `execution_dependencies` are used to populate the sandbox, but are not fetched when `TransitiveTargetsRequests` are made by subsequent targets; `dependencies` is now deprecated and replaced with `output_dependencies`, which are the dependencies necessary for successfully consuming the outputs of the `experimental_shell_command`.

To make it easier to migrate, if `dependencies` are specified _without_ declaring `execution_dependencies`, a deprecation warning will be raised explaining how to upgrade.

Closes #17345
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants