Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pex_binary sandbox: Traversing dep graph, stopping after package targets #19155

Merged
merged 52 commits into from
Jul 15, 2023

Conversation

cognifloyd
Copy link
Member

@cognifloyd cognifloyd commented May 25, 2023

This builds on:

NB: This PR used to include the above PRs, but I split this PR up to make the pieces easier to review.

This PR only updates the pex_binary rule to use the new TraverseIfNotPackageTarget predicate when requesting the source files that should be included in this pex. Wheels, other pex_binaries, and other package targets are not handled via this code path since those are not python sources that get included in the pex based on the include_sources flag.

Fixes #15855

This fixes #15855 because anything in a python_distribution does not need to be included as a source file in the pex_binary. wheels get included via the LocalDists rules. In some cases, we might still get more sources in the pex than intended. This might happen if a dependency is inferred between sources, bypassing the logic that looks for wheels that own the relevant things. In any case, this is PR provides an improvement and resolves the sample error in #15855.

Related:

@huonw
Copy link
Contributor

huonw commented May 27, 2023

This might be related to #15855 (and may even fix it?).

@cognifloyd
Copy link
Member Author

As is, this won't fix any issues, but it does lay the ground work for such fixes. Hopefully those fixes will be as simple as replacing TransitiveTargetsRequest with TransitiveTargetsWithoutTraversingPackagesRequest.

I suspect additional code changes will be required however, because this changes an assumption that has been around for awhile. So, there are likely to be other bits of code that need to be adjusted.

Using this will be more complex for Python distributions, in particular, because of how dist "ownership" of sources is calculated. Hopefully that is not the case for pex binaries.

@cognifloyd
Copy link
Member Author

cognifloyd commented May 27, 2023

Ooh! It looks like the PEX binaries case will be just that simple!

transitive_targets = await Get(
TransitiveTargets, TransitiveTargetsRequest(request.addresses)
)
sources = await Get(PythonSourceFiles, PythonSourceFilesRequest(transitive_targets.closure))

complete_platforms=complete_platforms,
output_filename=output_filename,
layout=PexLayout(field_set.layout.value),
additional_args=field_set.generate_additional_args(pex_binary_defaults),
include_requirements=field_set.include_requirements.value,

@thejcannon
Copy link
Member

I think there's likely a handful of places this would be extremely useful, and we may already be doing something similar. archive and docker come to mind.

@cognifloyd
Copy link
Member Author

I have a fix for pex_binaries #15855. I pushed the test to show it fails without the fix, and then I'll push the commit that uses the new TransitiveTargetsWithoutTraversingPackagesRequest to resolve the issue. The failing test can be seen here:
https://github.com/pantsbuild/pants/actions/runs/5147041844/jobs/9266975422#step:11:346

I think there's likely a handful of places this would be extremely useful, and we may already be doing something similar. archive and docker come to mind.

archive has SpecialCasedDependencies fields instead of a dependencies field. It only accepts packages (which get generated), and files (ie any other source types have to use wrap_as_files). Maybe a future refactor could add a dependencies field that uses this TransitiveTargetsWithoutTraversingPackagesRequest to grab ALL sources as files, except for packages which get built. In any case, that would be a new feature which is beyond the scope of this PR.

docker_image generates packages for transitive targets, but only includes source files from direct dependencies. Refactoring that to get transitive sources as well would probably be an excellent improvement, but is more involved than the pex_binary fix, so I'm going to leave that for a different PR as well. This looks like the relevant rule for such a refactor:

async def create_docker_build_context(request: DockerBuildContextRequest) -> DockerBuildContext:
# Get all targets to include in context.
transitive_targets = await Get(TransitiveTargets, TransitiveTargetsRequest([request.address]))
docker_image = transitive_targets.roots[0]
# Get all dependencies for the root target.
root_dependencies = await Get(Targets, DependenciesRequest(docker_image.get(Dependencies)))
# Get all file sources from the root dependencies. That includes any non-file sources that can
# be "codegen"ed into a file source.
sources_request = Get(
SourceFiles,
SourceFilesRequest(
sources_fields=[tgt.get(SourcesField) for tgt in root_dependencies],
for_sources_types=(
DockerContextFilesSourcesField,
FileSourceField,
),
enable_codegen=True,
),
)

@@ -538,7 +539,7 @@ async def create_pex_from_targets(
sources_digests.append(request.additional_sources)
if request.include_source_files:
transitive_targets = await Get(
TransitiveTargets, TransitiveTargetsRequest(request.addresses)
TransitiveTargets, TransitiveTargetsWithoutTraversingPackagesRequest(request.addresses)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the change required to fix #15855 (this line, and the rest of the logic behind it).

src/python/pants/engine/internals/graph.py Outdated Show resolved Hide resolved
src/python/pants/core/goals/package.py Outdated Show resolved Hide resolved
src/python/pants/engine/target.py Outdated Show resolved Hide resolved
@cognifloyd
Copy link
Member Author

I extracted #19306 from this as well. That PR adds the TraverseIfNotPackageTarget predicate, leaving the pex_binary bit here.

I might open a separate PR for the pex_binary stuff too because this PR is from a branch on my fork.

@cognifloyd
Copy link
Member Author

This predicate solution only helps for explicit deps because, as you pointed out, dep inference does not add packages to the dep graph.

For packages that can "own" particular sources--like python_distribution owns python_sources in the same dir and child dirs--the packaging rules for those types would have to do extra work to find those packages and subtract their "owned" sources. So, in addition to the TraverseIfNotPackageTarget, we might need to improve the ownership logic for the package types that need it.

@benjyw
Copy link
Contributor

benjyw commented Jun 16, 2023

I think what you describe is true, but probably not a common case. And the behavior you describe would be desirable.

The real world graph is much more than 3 nodes, and packages really shouldn't be traversed

What? The case I described is absolutely ubiquitous. AFAICT it exists in ~100% of the cases where some code depends on a first-party package.

@benjyw
Copy link
Contributor

benjyw commented Jun 16, 2023

This predicate solution only helps for explicit deps because, as you pointed out, dep inference does not add packages to the dep graph.

For packages that can "own" particular sources--like python_distribution owns python_sources in the same dir and child dirs--the packaging rules for those types would have to do extra work to find those packages and subtract their "owned" sources. So, in addition to the TraverseIfNotPackageTarget, we might need to improve the ownership logic for the package types that need it.

I feel like I'm misunderstanding something fundamental. When we traverse deps we follow inferred deps, right? And my point is that we will follow them in this case, bypassing the package target entirely. So we will still "see" library B as a "direct" dep of library A, which I thought is what we are trying to avoid (since we intend to pull that code in via the package)?

I think I am missing the use case this is intending to fix.

@thejcannon
Copy link
Member

Ah I see what you're saying now. This whole time I was thinking of runnable binaries (like pex_binary), but you're thinking python_distribution. OK.

@thejcannon
Copy link
Member

Although to be fair, doing things this way is better than what we have today, albeit not ideal. You'll net fewer dependencies (just not the minimum you might be expecting)

@cognifloyd
Copy link
Member Author

cognifloyd commented Jun 16, 2023

This predicate solution only helps for explicit deps because, as you pointed out, dep inference does not add packages to the dep graph.

For packages that can "own" particular sources--like python_distribution owns python_sources in the same dir and child dirs--the packaging rules for those types would have to do extra work to find those packages and subtract their "owned" sources. So, in addition to the TraverseIfNotPackageTarget, we might need to improve the ownership logic for the package types that need it.

I feel like I'm misunderstanding something fundamental. When we traverse deps we follow inferred deps, right?

Yes. We follow inferred deps.

And my point is that we will follow them in this case, bypassing the package target entirely. So we will still "see" library B as a "direct" dep of library A, which I thought is what we are trying to avoid (since we intend to pull that code in via the package)?

Yes and no. Yes we will still "see" library B as a "direct" dep of library A. But we will not get there VIA a package target.

My changes cannot replace the python_distribution logic because the "ownership" is NOT based solely on the dependency graph. My changes are not designed to help the python_distribution use-case. My changes are designed for targets that do not get inferred dependencies; I'm focused on targets that only have explicit dependencies because there are no good ways to infer those deps.

I think I am missing the use case this is intending to fix.

Adding this deps traversal feature turned out to be a very easy win for pex_binary to improve things at least a bit. But, this was an after thought and a happy bonus.

Ok, here's the rabbit hole on my use case.


My key use-case is the nFPM backend which creates system packages (rpm, deb, archlinux, and apk/Alpine). Packaging an nfpm_*_package target requires not only which files to install, it needs to know about ownership, permissions, and install location for all files in the package. Trying to shove that into the package target was unwieldy, so I am adding nfpm_content_file(s) targets. If a group of files will be installed in the same directory with the same permissions, use the nfpm_content_files target generator. Otherwise, use the atomic nfpm_content_file target. This target will depend on the target that actually provides the file, as well as any other nfpm_content_* that that given file will need installed with it. Both pants and pants users need to be able to understand how the dependency graph translates into packaged files. So, I want to simplify by adding restrictions to how I traverse the dependency graph. The nFPM backend will request the transitive targets twice: once to generate the list of files (with permissions, etc) to pass to nFPM defining what will be included in the package; and a second pass to build the sandbox from which nFPM will pluck the files it puts in the package. When constructing the sandbox:

  • a key assumption is that inference doesn't help here - anything in the package has to have an explicit target that defines how it is included and installed.
  • a dependency on another nfpm_*_package means there's a package relationship. That package, however, will not be included in the sandbox because it does not represent a file that can be included in the nFPM package. Any dependency of the other nFPM package(s) should also NOT be included or processing the deps becomes confusing and messy.
  • a dependency on a generic target that depends on nfpm_content_* would facilitate sharing dep lists across multiple packages targets (an rpm, and a deb for example).
  • a dependency on nfpm_content_* which can also depend on other nfpm_content _* means we can't just take the direct deps, we want the transitive deps.
  • each nfpm_content_file depends on one other package or source file that should be added up the sandbox. If a source file has deps that pull in other files, they will land in the sandbox, but nFPM will ignore them unless a matching nfpm_content_file(s) describes where/how they should be installed.
  • Since each file has to be registered with nFPM, it is not helpful to traverse the deps of packages. It is simpler and less surprising to say that the deps of the package are assumed to be build-time deps, and so they are not included in the sandbox used to package that binary. Any runtime file deps would have to have the matching nfpm_content_file(s) targets.
  • since deps of a package are assumed to be build time deps, we also assume that those build time deps do not include any runtime/install time deps on nfpm_* targets. So, traversing any package deps is wasteful (copy more than needed), confusing, and it makes it harder to reason about what will be packaged.

So, I wanted a list of transitive deps from a dep tree walk that does not traverse any package targets. Then I can sort those targets into lists of nFPM packages, other packagable targets, nfpm_content targets, and all other source files. Then I request packages to put in the sandbox with the remaining source files.

@cognifloyd
Copy link
Member Author

(I made this diagram to help me think through the different pieces of the nfpm dependency tree: https://whimsical.com/pants-deps-U25HpngLy7qD1SDCoLXmdT )

@benjyw
Copy link
Contributor

benjyw commented Jun 16, 2023

Got it, thank you! That is exactly what I was missing. My thinking was too Python-centric.

OK, so this makes sense to me now as a feature. And I think the implementation overall makes sense.

I can review when this is out of draft.

jriddy pushed a commit that referenced this pull request Jun 17, 2023
…plugins (#19306)

This builds on #19272, adding another `should_traverse_deps_predicate`
that stops dependency traversal at any package targets.

This is mostly extracted from #19155.

`TraverseIfNotPackageTarget` will be useful whenever a
`TransitiveTargetsRequest`, `CoarsenedTargetsRequest`, or
`DependenciesRequest` could benefit from treating package targets as
leaves. This PR does not change any `TransitiveTargetsRequest`s because
that is probably a change in user-facing behavior (even if it counts as
a bugfix) and needs to be documented as such. This PR merely adds the
feature, which, on its own, does not impact anything else.

Related:
- #18254
- #17368
- #15855
- #15082

---------

Co-authored-by: Andreas Stenius <andreas.stenius@imanage.com>
@cognifloyd cognifloyd changed the title Traversing dep graph, stopping at package targets pex_binary sandbox: Traversing dep graph, stopping after package targets Jun 27, 2023
@cognifloyd cognifloyd marked this pull request as ready for review June 27, 2023 21:38
@cognifloyd
Copy link
Member Author

This PR only covers the pex_binary bug fix now. I have merged in main and adjusted the PR description to explain what is included.

@cognifloyd cognifloyd added category:bugfix Bug fixes for released features and removed category:plugin api change labels Jun 27, 2023
@cognifloyd cognifloyd removed the request for review from Eric-Arellano June 28, 2023 19:38
@cognifloyd
Copy link
Member Author

@benjyw @stuhood @thejcannon this is out of draft now. :)

Copy link
Member

@thejcannon thejcannon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that PR is tiny, and does fix the linked bug I think it's good to go.

I'd love to see this evolve further to continue aligning with user expectations, but we can do that incrementally.

Thanks again for all your hard work here.

(If you don't mind, I'd feel good if one other maintainer approved as well 😉 )

roots=request.addresses,
union_membership=union_membership,
),
),
)
sources = await Get(PythonSourceFiles, PythonSourceFilesRequest(transitive_targets.closure))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also do the right thing and try and package the package targets for inclusion?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(This is a philosophical question and likely shouldn't be addressed in this PR)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then it's not sources but local dists? which is handled further down, iiuc..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh that's even weirder I guess. I'm thinking of other packageables, like a pex_binary or an archive or some plugin-provided packageable thing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you saying to embed other PEXes/archives/etc... inside PEXes?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure why not? If someone has declared that dependency (between a Python source and a pex_binary) the intent is pretty clear to me... that code depends on that PEX.

roots=request.addresses,
union_membership=union_membership,
),
),
)
sources = await Get(PythonSourceFiles, PythonSourceFilesRequest(transitive_targets.closure))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then it's not sources but local dists? which is handled further down, iiuc..

@cognifloyd cognifloyd merged commit 5ee6395 into pantsbuild:main Jul 15, 2023
@cognifloyd cognifloyd deleted the transitives-and-packages branch July 15, 2023 05:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category:bugfix Bug fixes for released features
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pex_binary depending on python_distribution is consuming transitive deps
6 participants