Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support running pex_binary targets directly on the in-repo sources. #15689

Merged
merged 4 commits into from
May 31, 2022

Conversation

benjyw
Copy link
Contributor

@benjyw benjyw commented May 27, 2022

Some processes, such as Django's makemigrations use the __file__ and/or __path__ of modules to generate new paths to write to, relative to those. If run against sources in the tmpdir, those files will be written into the tmpdir and then cleaned up.

Running inline is necessary in some cases, and whether it is necessary or not is a property of the binary (does it need to know the true locations of the source files it runs on), and so the inline-ness is a property of a pex_binary target rather than a command-line option.

The consequences of running inline are that more code will be accessible on the sys.path - basically all the code in the relevant source roots. Theoretically this could lead to different behavior than running sandboxed. But this is unlikely to be a problem in practice, especially for the types of binary we're talking about. They already implicitly expect to run with that context, so if there is a difference in behavior then arguably the sandboxed behavior is the wrong one.

Note that this concept of running "inline" is Python-specific, not generic to the run goal. There is no meaning to running inline for a compiled language.

[ci skip-rust]

[ci skip-build-wheels]

@benjyw
Copy link
Contributor Author

benjyw commented May 27, 2022

Fixes #12129 for run. A separate change can fix for repl.

@benjyw benjyw requested a review from stuhood May 28, 2022 23:26
[ci skip-rust]

[ci skip-build-wheels]
@thejcannon
Copy link
Member

What's the chances this makes it in a 2.12 release? I want to upgrade us and this would be HUGE for migrations

@@ -548,6 +548,25 @@ class PexIncludeToolsField(BoolField):
)


class RunInlineField(BoolField):
alias = "run_inline"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatives:

  • Call it run_in_repo (or similar) to capture intent. (Instead of introducing a new word: "inline")
  • Flip it around. I expect most users assume run runs the code in-repo, so make this field align with expectations. run_in_sandbox defaulting to True. Then the flag becomes --no-run-in-sandbox.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I like run_in_sandbox

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that there is no flag, this is a field on a target

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes. Good point. Might be worth pontificating a flag akin to skip being a flag and a field.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this concept of running "inline" is Python-specific, not generic to the run goal.

This is specific to any run-as-is (non-compiled, uncompiled?) language. So languages like Ruby, JS, or Shell would all be valid.

[ci skip-rust]

[ci skip-build-wheels]
@benjyw
Copy link
Contributor Author

benjyw commented May 31, 2022

What's the chances this makes it in a 2.12 release? I want to upgrade us and this would be HUGE for migrations

Could contemplate backporting, although I prefer not to for features. But this one seems relatively harmless since you have to turn it on in the target.

@thejcannon
Copy link
Member

I suppose an alternative we maybe haven't considered (and likely won't in the very short term) is that we're somewhat overusing pex_binary here for run and for package, when in reality they are two very different use-cases and exhibit very different behavior.

A user might rightfully expect run of a packageable to be essentially a package followed by executing the package.

In that regard the target here might be python_script, which is only runnable and not packageable.

@thejcannon
Copy link
Member

Could contemplate backporting, although I prefer not to for features. But this one seems relatively harmless since you have to turn it on in the target.

Scratch that. I opened the PR and realized that:

  • The relevant change to make this work is in the run goal support for pex_binary
  • The goal code is minimal

In order to not bully features here (or have to wait for migrations), and somewhat tangentially related to the above I think I'm going to experiment with a python_script target as an in-repo plugin in our codebase.

# Rust tests and lints will be skipped. Delete if not intended.
[ci skip-rust]

# Building wheels and fs_util will be skipped. Delete if not intended.
[ci skip-build-wheels]
@benjyw
Copy link
Contributor Author

benjyw commented May 31, 2022

I suppose an alternative we maybe haven't considered (and likely won't in the very short term) is that we're somewhat overusing pex_binary here for run and for package, when in reality they are two very different use-cases and exhibit very different behavior.

A user might rightfully expect run of a packageable to be essentially a package followed by executing the package.

In that regard the target here might be python_script, which is only runnable and not packageable.

I'm not sure about that. If we consider targets to be descriptive then in both cases this is "python code with an entry point". Think about your arguments for the prosecution in the case of Files vs. Resources, that targets should describe a thing, not how the thing is used.

@thejcannon
Copy link
Member

I suppose an alternative we maybe haven't considered (and likely won't in the very short term) is that we're somewhat overusing pex_binary here for run and for package, when in reality they are two very different use-cases and exhibit very different behavior.
A user might rightfully expect run of a packageable to be essentially a package followed by executing the package.
In that regard the target here might be python_script, which is only runnable and not packageable.

I'm not sure about that. If we consider targets to be descriptive then in both cases this is "python code with an entry point". Think about your arguments for the prosecution in the case of Files vs. Resources, that targets should describe a thing, not how the thing is used.

Very Fair. Point taken. In that regard perhaps pex_binary is itself a misnomer and an alias for python_script. In this instance pex is really only an implementation detail of the requirements (not 100% true, but mostly), and is less-but-still-technically an implementation detail for package.

To drive that home, too, fields like include_requirements, include_tools, and layout are meaningless to this target if only ever run. Really you just need dependencies and entry_point. Then (and I'm going somewhat off the rails here, but keep your imagination open) maybe a package_opts field set to an instance of a PexOptions (a la HTTPSource). It allows for not bloating the target type on details of PEX it will never use, and also allows for other package options to share the python_script target type (PyInstallerOptions or PyOxyOptions).

@benjyw
Copy link
Contributor Author

benjyw commented May 31, 2022

Yeah, I think it's pex_binary because package turns it into a pex, and if there were some other format, it would have a separate target, which would itself also be runnable

@thejcannon
Copy link
Member

Actually... We're running first-class sources, right? In this case $ ./pants run path/to/manage.py makemigrations

.... What would it look like to have ./pants run path/to/manage.py run via a FieldSet containing PythonSourceField? I'm gonna try it 😂

@benjyw benjyw merged commit 981ebcc into pantsbuild:main May 31, 2022
@benjyw benjyw deleted the run_on_original_sources branch May 31, 2022 22:07
Copy link
Contributor

@Eric-Arellano Eric-Arellano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

Comment on lines +584 to +585
is an example of such a process. It may also have lower latency, since no files need
to be copied into a chroot.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This last sentence isn't true anymore, is it? We always copy files.

@@ -568,6 +568,25 @@ class PexIncludeToolsField(BoolField):
)


class RunInSandboxField(BoolField):
alias = "run_in_sandbox"
default = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this default to False? It increases the likelihood Pants Just Works, vs making you discover a niche field.

@thejcannon
Copy link
Member

Actually... We're running first-class sources, right? In this case $ ./pants run path/to/manage.py makemigrations

.... What would it look like to have ./pants run path/to/manage.py run via a FieldSet containing PythonSourceField? I'm gonna try it joy

This works like a charm with, maybe a one-line(-ish) change. I don't like the implication 😈

# complexity of figuring out here which sources were codegenned, we copy everything.
# The inline source roots precede the chrooted ones in PEX_EXTRA_SYS_PATH, so the inline
# sources will take precedence and their copies in the chroot will be ignored.
local_dists.remaining_sources.source_files.snapshot.digest,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had this concern earlier, but couldn't substantiate it. Now I can.

If we're generating sources as submodules of a parent module, Python will use the parent's path for submodule lookup. So the in-repo path is the only path considered and it fails to import the codegened module 😢

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "fix" is to opt-out of the in-repo running, so everything is in the sandbox. Not great but not fatal.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From https://docs.python.org/3/reference/import.html

The meta path may be traversed multiple times for a single import request. For example, assuming none of the modules involved has already been cached, importing foo.bar.baz will first perform a top level import, calling mpf.find_spec("foo", None, None) on each meta path finder (mpf). After foo has been imported, foo.bar will be imported by traversing the meta path a second time, calling mpf.find_spec("foo.bar", foo.__path__, None). Once foo.bar has been imported, the final traversal will call mpf.find_spec("foo.bar.baz", foo.bar.__path__, None).

So in my case the submodule is at foo.bar.baz, but the mpf.find_spec("foo", None, None) gives a valid spec for an in-repo module, so it assumes the in-repo tree is where we'll find foo.bar.baz. I can maybe try to play around with namespace packages, but that seems hacky and brittle.

Unfortunately I don't have a solution right now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can maybe try to play around with namespace packages

Blasting every __init__.py up the chain with https://packaging.python.org/en/latest/guides/packaging-namespace-packages/#pkgutil-style-namespace-packages worked.

Tricky Tricky.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the only sane way to solve this is to have the relevant packages be namespace packages. This can be a caveat we document, but since codegen use is not super-common, and needing to run inline is not super-common, I think that's OK.

Copy link
Member

@stuhood stuhood Jun 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the only sane way to solve this is to have the relevant packages be namespace packages. This can be a caveat we document, but since codegen use is not super-common, and needing to run inline is not super-common, I think that's OK.

Not quite... the reason this works in the sandbox is because we merge colliding PYTHONPATH entries into a single entry. It would be possible to do something similar in this mode, by using exclusively a sandbox entry if not doing so would result in packages being split across PYTHONPATH entries.

Whether that qualifies as "sane" is an open question, but.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That does not qualify as sane IMO. The whole point of this change is to allow you to force something to run in the repo if it needs to. Things do not generally need to. I don't think we need to switch the default here, for example. But if you do need to we shouldn't silently not do what you asked.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the proposal is to change the default behavior then that would have to be strongly motivated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm very of the mind that this PR only fills the gap of not being able to ./pants run random_file.py in-repo. To me, running the `pex_binary`` actually runs the PEX binary (and there is no "sandbox" vs in-repo toggle).

That being said, this particular issue will travel to the ./pants run random_file.py implementation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants