Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build and install python_distribution deps for test, run, and repl #12573

Merged
merged 14 commits into from
Sep 16, 2021

Conversation

benjyw
Copy link
Contributor

@benjyw benjyw commented Aug 15, 2021

If you (transitively) depend on a python_distribution, Pants will now build and
install it into the venv for test, run and repl.

We don't do anything special when packaging, under the assumption that the
python_distribution will itself be packaged and depended on as a published artifact.

This change allows you to seamlessly use existing/custom setup.py - notably to build Python
native extensions - in Pants. If a setup.py can build it, Pants can consume it.

Note that adding this support was very little code, thanks to the engine. Most of this
change is tests.

This change also tweaks the testprojects native extensions example, to clarify what
is going on.

Addresses #11942

[ci skip-rust]

[ci skip-build-wheels]

@benjyw
Copy link
Contributor Author

benjyw commented Aug 15, 2021

Note that this reveals a slight awkwardness with runtime_package_dependencies. The runtime_package_dependencies are only for tests, and are just packaged and put into the chroot, for the test to run. That continues to makes sense for pex binaries, that the test can invoke, but less so for dists - what can a test do with a .whl file? With this change you can simply depend on a python_distribution and the natural thing will happen.

We may want to rethink runtime_package_dependencies, but that is future work.

[ci skip-rust]

[ci skip-build-wheels]
Copy link
Contributor

@jsirois jsirois left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure the sys.path duplication is a fatal problem for things as they stand unless dependening on a python_distribution does not actually cause a dependency on the python_distribution's dependencies in turn.

@@ -46,6 +68,7 @@ async def create_python_repl_request(repl: PythonRepl, pex_env: PexEnvironment)
extra_env = {
**complete_pex_env.environment_dict(python_configured=requirements_pex.python is not None),
"PEX_EXTRA_SYS_PATH": ":".join(chrooted_source_roots),
"PEX_PATH": repl.in_chroot(local_dists.pex.name),
Copy link
Contributor

@jsirois jsirois Aug 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think you now depend on the local_dists twice. Once up in PEX_EXTRA_SYS_PATH above as loose sources (with no .dist-info/ metadata dir) and then here as an installed wheel with a .dist-info/ metadata dir. I'm not sure which is 1st on the final sys.path and the order should not matter anyhow since there will be just one proj-rev.dist-info/ dir on the sys.path, but ... this duplication of everything else feels like we're walking into a trap. For example, although the .dist-info/ case works out solving one user's issue, I'm pretty sure a native extension that lives in a package may not. For example, if there is pure python code packaged as package/module.py and package/module.py does not live in a maespace package and the loose source version of this file ends up 1st on the sys.path, then the package/native.cpython-39-x86_64-linux-gnu.so located in an installed wheel on the PEX_PATH will never be found.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah:

$ cat <<EOF > BUILD.test
python_library(
  name="jake",
  dependencies=[
    "src/python/pants:pants-packaged",
  ],
)
EOF
^jsirois@gill ~/dev/pantsbuild/pants (main) $ ./pants filedeps --transitive //:jake
3rdparty/python/BUILD
3rdparty/python/requirements.txt
BUILD.test
src/python/pants/BUILD
src/python/pants/VERSION
src/python/pants/__main__.py
src/python/pants/backend/awslambda/python/BUILD
src/python/pants/backend/awslambda/python/__init__.py
src/python/pants/backend/awslambda/python/lambdex.py
src/python/pants/backend/awslambda/python/register.py
...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is correct, and in fact occurred during my testing. The loose sources win, and make the native code unimportable if not in a namespace package.

The duplication itself can be mitigated using the !! operator to exclude deps. And we can mitigate it in the future by A) not attempting to infer deps that can be satisfied by a local dist and B) not following explicit deps through a python_distribution target when computing the transitive closure.

But the namespace package problem may occur regardless, even if we eliminate duplication, because of code outside the dist sharing a package with code in the dist. So depending on a dist like this does require you to know what you're doing.

That said, as this stands, it solves a real user problem, albeit with a gotcha, which we can document. So, on balance, we might want to release this and potentially deal with the missing pieces as above in a followup. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no well justified thoughts. It seems to me subtracting the closure of a python_distribution dep sub graph from the larger graph in play is not actually hard, in which case it would be great to just take a day or a week to do things right and ship in the next dev release. I may be underestimating the difficulty of doing this right from the get go though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm not sure how hard it would be. But it doesn't fully solve the namespace problem. In most/many cases you will still have to ensure the code is in a namespace package, at which point the duplication problem seems less critical.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"inner" or "code" targets that own files (python_library, python_tests, files, resources), and then "artifact" targets (python_distribution, pex_binary) that don't own files but instead represent an artifact to be built. We don't model this distinction explicitly, but we probably should. Maybe we should even use different terms for them.

This is precisely why I have been so excited about that cleanup a few months ago for python_awslambda and pex_binary to no longer have sources. This is the mental model I've been using since that breakthrough, including when envisioning the elusive "re-envisioning targets" proposal. I do think there's a big difference between those two types of targets, and it is valuable to codify it more.

Still handwavy, but imo: targets work really well for artifact-type targets like pex_binary. They also seem to work well for 3rd-party deps metadata. They do not work well for first-party code, with issues like it being too hard to update just 1 file's metadata. So, I could envision us codifying this distinction by replacing 1st-party targets w/ something else entirely, for example.

--

about now having to keep track of all the different unwritten semantics.

@benjyw it sounds like you are envisioning a particular target always behaving the same way when in dependencies, and not depending on the call site's custom semantics? (Outside of project introspection). So, if you depend on a python_distribution, that always means to install it?

That consistency would be good imo and make me more comfortable with this.

If so, where would that be documented? On the python_distribution target itself? In the dependencies field help message for each consumer? (I'm asking because I think it's important we know how to explain this to users.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By having special-cased fields, you could use the same python_distribution to both a) install as a local dist you can import, b) inspect the .whl file, and c) use in some way we can't anticipate that plugin authors write. Or, pick and choose from those 3.

To clarify something: both of my suggestions involved using the SpecialCasedDependencies field, but they proposed using them in different places. Would suggest taking another look there.

The current implementation allows for either "consumer decides" or "producer decides", depending on where SpecialCasedDependencies are used.

@benjyw it sounds like you are envisioning a particular target always behaving the same way when in dependencies, and not depending on the call site's custom semantics? (Outside of project introspection). So, if you depend on a python_distribution, that always means to install it?

This is also what I felt would likely be least redundant:

I think that given that we have the python_distribution target which already "marks" a particular intended behavior, option 2 would be redundant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - I agree that we always want the same behavior when you depend on a python_distribution, and I think that behavior fairly obviously should be "build the dist and install it in the downstream environment". Having a dependency mean "depend on the python_library targets that a python_distribution depends on", seems weird. And if someone really wants to depend on those python_library targets, they can just depend on them directly.

In basically 100% of cases today, depending on a python_distribution is a smell, and we have masked the smell by giving that dependency a not-particularly-useful meaning. And I doubt there are many such dependencies today, since it doesn't really make sense, especially with dep inference. What I'm proposing is to have that dependency mean something more intuitive.

That said, again, dep inference really muddies the water...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think you now depend on the local_dists twice. Once up in PEX_EXTRA_SYS_PATH above as loose sources (with no .dist-info/ metadata dir) and then here as an installed wheel with a .dist-info/ metadata dir.

This still seems like a blocking issue. It seems like you need to introduce SpecialCasedDependencies for the distribution target.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but that breaks other things.

# Rust tests and lints will be skipped. Delete if not intended.
[ci skip-rust]

# Building wheels and fs_util will be skipped. Delete if not intended.
[ci skip-build-wheels]
# Rust tests and lints will be skipped. Delete if not intended.
[ci skip-rust]

# Building wheels and fs_util will be skipped. Delete if not intended.
[ci skip-build-wheels]
# Rust tests and lints will be skipped. Delete if not intended.
[ci skip-rust]

# Building wheels and fs_util will be skipped. Delete if not intended.
[ci skip-build-wheels]
@benjyw
Copy link
Contributor Author

benjyw commented Aug 20, 2021

So, how to make progress here? What pre- or post-work would make folks comfortable with merging this?

[ci skip-rust]

[ci skip-build-wheels]
[ci skip-rust]

[ci skip-build-wheels]
@benjyw
Copy link
Contributor Author

benjyw commented Aug 25, 2021

Hi reviewers! I'd like to move towards merging this. There appear to be two contentious issues:

  1. What should depending on a python_distribution mean? Today it means nothing, or, rather, it effectively means "depend on the dependencies of the python_distribution. You could replace the python_distribution with a vanilla target and get the same behavior. So almost certainly no one is actually purposely doing this. With this change, depending on a python_distribution will mean something that I believe is sensible and intuitive.

  2. This introduces trickiness wrt sources appearing twice on the syspath, once via the built dist, and once via source deps (especially inferred ones). This means that the packages in question must be namespace packages. This can be a known and documented gotcha, until we can mitigate it. The mitigation (subtracting deps that are satisfied by the dist from the sources, and/or using SpecialCasedDependencies in some way) may be simple, or there may be complications, but either way I'd prefer to do it in a followup.

This solves a real user's real problem, and likely doesn't affect other users who don't need this feature.

So - can I get a review with an eye to merging? Or do you object to this plan? Thanks.

@Eric-Arellano
Copy link
Contributor

  1. What should depending on a python_distribution mean?

I'm now okay with this proposal because you are planning for depending on a python_distribution to always behave the same way. That consistency is key for me.

I will feel even more comfortable with this if we document the behavior explicitly. Probably in the help for python_distribution? Can you think of other places (given that duplication is good in docs)?

--

Should typecheck and Pylint probably also use this new behavior?

--

  1. This introduces trickiness wrt sources appearing twice on the syspath

FYI I think the fix is to change the dependencies field for python_distribution to subclass SpecialCasedDependencies rather than Dependencies. This means that when a python_distribution is included in a TransitiveTargetsRequest etc, its depenencies will not be included by default - that's what we want. But, it will still be included when running ./pants dependencies / project introspection.

It does become a little awkward to consume the dependencies for the python_distribution code in production code, though. Naively, you could do TransitiveTargetsRequest([python_dist.address], include_special_cased_dependencies=True) to get the deps of the python_dist. But, that will result in pulling in the special cased dependencies for other target types that we may not want, e.g. pulling in the original files targets that a relocated_files target depends on - both the files and relocated_files targets would be put into the dist, which is wrong. Instead..you'd do this to get the direct deps of the python_dist, and then you can convert to transitive targets from there

await Get(UnparsedPythonAddresses, python_dist[PythonDistDependencies].to_unparsed_address_inputs())

I do think this should be prework, and it makes sense to have a dedicated entry in the changelog: the API change is that depending on a python_distribution will no longer pull in all of its transitive deps. That should be explicitly marked in the changelog.

I know this all probably sounds Greek - feel free to delegate to me as prework for landing this.

[ci skip-rust]

[ci skip-build-wheels]
@benjyw
Copy link
Contributor Author

benjyw commented Aug 30, 2021

Interesting. This kind of problem keeps popping up - @kaos's Docker work has a similar issue - depending on a docker_image target means depending on the built image (and consuming it in the FROM of another docker_image's Dockerfile), not on that target's dependencies.

So I wonder if we want to bite the bullet and finally make a formal distinction between "edge targets" and "library targets". I will chew on this.

@kaos
Copy link
Member

kaos commented Aug 31, 2021

Getting to know more about special cased deps just now, I'm thinking if it wouldn't be useful to be able to get deps transitively with a exclude/include lists of dep fields, a la source type fields arg to a sources req. That way you have finer level of control to slice and dice the deps you get.

@benjyw
Copy link
Contributor Author

benjyw commented Sep 2, 2021

So, "make binary targets' dependencies field be a SpecialCasedDependencies instead of a Dependencies" is non-trivial.

Probably we want to re-think SpecialCasedDependencies which has a vague name and vague behavior, and is not a subclass of Dependencies despite its name. Instead I think we probably want just Dependencies, and that a target can only have one such field. If we do need fine-grained control over how to handle a specific dependency (such as build it vs. pull in its sources) we can use @jsirois's suggestion of an @ annotation suffix on the dependency address. But this change might require deprecation cycles and so on.

All that said, since this pre-work has become a yak shave, I would like to go ahead and merge this, since it solves a real user problem (with a gotcha that we can mitigate with the above-mentioned work). Letting it bit-rot while we slowly finesse the Dependencies issue doesn't seem great.

Thoughts?

@kaos
Copy link
Member

kaos commented Sep 2, 2021

Makes sense to me, at least ;)

I like @jsirois suggestion with @ annotations too. It opens up for a great deal of flexibility, if we allow adding custom annotations in plugins, but all that is for another issue.

@Eric-Arellano
Copy link
Contributor

What happens with this PR if a python_distribution depends on a python_distribution? (Sorry, don't have time rn to test that out myself.)

@stuhood
Copy link
Member

stuhood commented Sep 2, 2021

Probably we want to re-think SpecialCasedDependencies which has a vague name and vague behavior, and is not a subclass of Dependencies despite its name. Instead I think we probably want just Dependencies, and that a target can only have one such field. If we do need fine-grained control over how to handle a specific dependency (such as build it vs. pull in its sources) we can use @jsirois's suggestion of an @ annotation suffix on the dependency address. But this change might require deprecation cycles and so on.

Alternatively, allow people to filter the dependencies that they request by Dependencies field subclass type, and have the default not include the kinds of things that SCD currently filters.

If we do need fine-grained control over how to handle a specific dependency (such as build it vs. pull in its sources) we can use @jsirois's suggestion of an @ annotation suffix on the dependency address. But this change might require deprecation cycles and so on.

I do think that "consumer decides" works in some cases, but there are plenty of others where that would just seem like a gotcha. The only reason to depend on a python_distribution rather than the python_library behind it would be because you wanted the distribution: the target type already sends the signal of what it should be.

@Eric-Arellano
Copy link
Contributor

I do think that "consumer decides" works in some cases, but there are plenty of others where that would just seem like a gotcha. The only reason to depend on a python_distribution rather than the python_library behind it would be because you wanted the distribution: the target type already sends the signal of what it should be.

Yeah, agreed. @benjyw has convinced me we want depending on a target type to always mean the same thing. I agree. It will be important to document that behavior somehow, so people know "depending on a python_distribution always results in building it and inserting into chroot - unless it's project introspection."

[ci skip-rust]

[ci skip-build-wheels]
@benjyw
Copy link
Contributor Author

benjyw commented Sep 10, 2021

So, @Eric-Arellano attempted to implemented his proposed suggestion as pre-work, and discovered, as I did when I also attempted something similar in #12709, that it's more complex than it seems, and requires some thinking.

Meanwhile, as I keep saying, this change solves a real user problem, and is slowly bitrotting. So please review with an eye to merging, with the dependency semantics issue (which is wider than just this change) to be addressed later, or argue against merging this. Thanks!

@stuhood
Copy link
Member

stuhood commented Sep 10, 2021

So, @Eric-Arellano attempted to implemented his proposed suggestion as pre-work, and discovered, as I did when I also attempted something similar in #12709, that it's more complex than it seems, and requires some thinking.

What was the conclusion? I don't see the blocking issue described on #12709.

# Rust tests and lints will be skipped. Delete if not intended.
[ci skip-rust]

# Building wheels and fs_util will be skipped. Delete if not intended.
[ci skip-build-wheels]
[ci skip-rust]

[ci skip-build-wheels]
@benjyw
Copy link
Contributor Author

benjyw commented Sep 12, 2021

Just remembered/realized that even the proposed pre-work of changing the python_distribution target's dependencies field to be a SpecialCasedDependencies won't solve the "sources appearing twice on the syspath" problem, because dep inference will bring in at least some of those dependencies regardless: Presumably if some code is consuming a dist, it's importing symbols from that dist, and dep inference will obliviously do its thing. The fix to that might be for dep inference to infer deps on the dist instead of on the source, but that is complicated - we now have to treat sources inside the dist differently from sources outside the dist that consume it.

So I'm back to thinking that the pre-work is not necessary for the purpose of this change, as it doesn't actually solve the problem. We do need to re-think how we handle dependencies in these various cases, but I'm back to thinking that it doesn't block this change.

@benjyw
Copy link
Contributor Author

benjyw commented Sep 12, 2021

Probably the right solution is something @jsirois suggested - simply subtracting the sources in the dist from the loose sources.

@Eric-Arellano
Copy link
Contributor

Eric-Arellano commented Sep 12, 2021

Probably the right solution is something @jsirois suggested - simply subtracting the sources in the dist from the loose sources.

Okay I think that makes sense. What about third-party dependencies? Edit: I think you wouldn't have sys.path issues, but would install deps twice? When installing the dist and also as normal 3rd-party requirements.

@benjyw
Copy link
Contributor Author

benjyw commented Sep 12, 2021

Probably the right solution is something @jsirois suggested - simply subtracting the sources in the dist from the loose sources.

Okay I think that makes sense. What about third-party dependencies? Edit: I think you wouldn't have sys.path issues, but would install deps twice? When installing the dist and also as normal 3rd-party requirements.

That's a pip thing - you're trying to pip-install two dists with the same project name, so pip will pick one. But in any case you shouldn't be trying to do that in the first place.

@benjyw
Copy link
Contributor Author

benjyw commented Sep 12, 2021

OK, latest commit implements that subtraction. It can be reviewed independently if you're familiar with the rest of this change,

[ci skip-rust]

[ci skip-build-wheels]
Copy link
Contributor

@Eric-Arellano Eric-Arellano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! How about updating the PR title so this better captures the breaking API change?

Depending on python_distribution now builds and installs it for test, run, and repl

[ci skip-rust]

[ci skip-build-wheels]
# Rust tests and lints will be skipped. Delete if not intended.
[ci skip-rust]

# Building wheels and fs_util will be skipped. Delete if not intended.
[ci skip-build-wheels]
@benjyw benjyw changed the title Make local dists available in test, run and repl. Depending on python_distribution now builds and installs it for test, run, and repl Sep 16, 2021
@benjyw benjyw changed the title Depending on python_distribution now builds and installs it for test, run, and repl Depending on python_distribution builds and installs it for test, run, and repl Sep 16, 2021
@benjyw benjyw changed the title Depending on python_distribution builds and installs it for test, run, and repl Build and install python_distribution deps for test, run, and repl Sep 16, 2021
@benjyw benjyw merged commit 8519ef6 into pantsbuild:main Sep 16, 2021
@benjyw benjyw deleted the local_dists branch September 16, 2021 21:59
[
pex.digest,
local_dists.pex.digest,
local_dists.remaining_sources.source_files.snapshot.digest,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are the remaining sources added here? If I depend on a python_distribution to get its wheel as a dependency, the python_distribution target itself may contain many more files than those that actually make it into the wheel. I don't want those files, just the wheel; that's why I picked depending on the python_distribution and not a python_sources target it depends on - say.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code has changed a bunch since this PR, but IIRC the issue is something like:

python_library_A explicitly depends on python_distribution_B and also imports module C from python_library_C which happens to be in the transitive dep closure of B, but is not published in B's wheel.

Rigorously, I suppose we could assume that C is published in some other distribution (or else how could the wheel of B consume it?) But I'm not sure that rigor holds up in reality. Maybe B depends on C via a spurious explicit dep, or the specific file from C imported by A isn't published in C's wheel.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what you describe here @benjyw might be related to #17594

Copy link
Contributor

@jsirois jsirois Feb 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kaos I think that just addresses 3rdparty transitive deps, not 1st party. The case @benjyw sketched with A, B, C is 1st party I believe. Unless I'm missing something, we have the information on disk to deal with Benjy's case. It might be a good deal of work to implement and it might be expensive in runtime, but its also opt-in and would give correct results. Right now depending on local dists has had a history of expediency that has solved immediate needs but consistent not been comprehensive; so we've continued to suffer paper-cuts from our perspective as devs running support but likely more serious wounds to the users encountering them.

Thanks for the case Benjy, that actually sounds likely now that you spell it out. I'll file an issue tracking the problem the current inclusion of extra files from LocalDist causes and then deciding whether to continue with expedient fixes (if there are any) vs. a comprehensive fix up attack can be made there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've filed #18254 to track the bug including the extra files can lead to.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants