Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calculate local distribution contents once per distribution #14551

Merged
merged 3 commits into from
Feb 22, 2022

Conversation

stuhood
Copy link
Member

@stuhood stuhood commented Feb 21, 2022

Currently, local distribution wheel contents are computed once per consumer, rather than once per distribution. Additionally, since the calculation of provided files is using DigestContents, it is briefly pulling the entire contents of wheels into memory. For small files, this might be fine: but larger dists can use a lot of memory, particularly in the presence of concurrency.

This change moves per-distribution calculations into a separate @rule to allow for reuse across multiple consumers, and moves to computing wheel contents using an external process to allow it to be cached run over run.

…nsumer.

# Rust tests and lints will be skipped. Delete if not intended.
[ci skip-rust]

# Building wheels and fs_util will be skipped. Delete if not intended.
[ci skip-build-wheels]
… extraction.

[ci skip-rust]
[ci skip-build-wheels]
@stuhood
Copy link
Member Author

stuhood commented Feb 21, 2022

Commits are useful to review independently.

Copy link
Contributor

@benjyw benjyw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Phew, nice one!

And finger wag at myself for writing the janky code to begin with.

@benjyw
Copy link
Contributor

benjyw commented Feb 22, 2022

So is the reason the original rule wasn't satisfied from memoization/cache that the LocalDistsPexRequest.sources field is generally different for each consumer?

class LocalDistWheels:
"""Contains the wheels isolated from a single local Python distribution."""

wheel_paths: list[str]
Copy link
Contributor

@jsirois jsirois Feb 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be true that wheel_paths are populated from a Snapshot below which happens to be sorted and it may be true that a set will have a stable order when formed multiple times in the same pantsd Python process - but both not encoding stable order and not encoding immutability in a "frozen" dataclass seems sub-optimal for sanity sake at the least.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea... I'll do this for consistency's sake.

But given how tightly we've adhered to functional patterns throughout the system, I do sometimes wonder whether defensively freezing dataclasses is worth it. It's clearly necessary when something will be used in a Get (for hash), but not necessary when something is only the return value of a @rule.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd argue if that's true its too subtle. If a rule returns a thing that thing ~must be used by another rule as an input and so you'd expect it better be immutable.

Put it another way, if I have to debug "strange" behavior in the engine, a 1st stop would be to look for violations of patterns in the Python rules. Being able to sanely rely on ~types would be a good 1st stop in that 1st stop. Seeing frozen != frozen would give me pause here without a comment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put it another way, if I have to debug "strange" behavior in the engine, a 1st stop would be to look for violations of patterns in the Python rules. Being able to sanely rely on ~types would be a good 1st stop in that 1st stop. Seeing frozen != frozen would give me pause here without a comment.

Sure. But we have had zero cases of those types of errors in the last few years... and whether that is due more to our defensive freezing of values, or due to mutating of a @rule-returned value just not being something that is likely/useful is the question that I'm referring to here.

@stuhood
Copy link
Member Author

stuhood commented Feb 22, 2022

So is the reason the original rule wasn't satisfied from memoization/cache that the LocalDistsPexRequest.sources field is generally different for each consumer?

Correct: that, and the Sources field has an Address in it. The build_local_dists rule still needs to run for every consumer (as it stands), since it's comparing the transitive deps sources to the consumed wheels.

[ci skip-rust]
[ci skip-build-wheels]
@stuhood stuhood enabled auto-merge (squash) February 22, 2022 18:05
@stuhood stuhood merged commit 40934ce into pantsbuild:main Feb 22, 2022
@stuhood stuhood deleted the stuhood/local-dists-perf branch February 22, 2022 18:53
@stuhood
Copy link
Member Author

stuhood commented Feb 22, 2022

@benjyw , @Eric-Arellano , @jsirois : Another thing to think about: rather than computing the transitive source dependencies, and then subtracting the set of transitive wheel dependencies, it would probably be more efficient to collect the transitive dependencies once in a distribution/wheel-vs-sources-aware way as we visited each node. That would be similar to the pattern behind the JVM recursively compiling CoarsenedTarget instances, and similar to what I think we might want to do for mypy.

EDIT: Opened #14561 about this.

stuhood added a commit to stuhood/pants that referenced this pull request Feb 22, 2022
…ld#14551)

Currently, local distribution wheel contents are computed once per consumer, rather than once per distribution. Additionally, since the calculation of provided files is using `DigestContents`, it is briefly pulling the entire contents of wheels into memory. For small files, this might be fine: but larger dists can use a lot of memory, particularly in the presence of concurrency.

This change moves per-distribution calculations into a separate `@rule` to allow for reuse across multiple consumers, and moves to computing wheel contents using an external process to allow it to be cached run over run.
# Rust tests and lints will be skipped. Delete if not intended.
[ci skip-rust]

# Building wheels and fs_util will be skipped. Delete if not intended.
[ci skip-build-wheels]
stuhood added a commit that referenced this pull request Feb 22, 2022
…ck of #14551) (#14555)

Currently, local distribution wheel contents are computed once per consumer, rather than once per distribution. Additionally, since the calculation of provided files is using `DigestContents`, it is briefly pulling the entire contents of wheels into memory. For small files, this might be fine: but larger dists can use a lot of memory, particularly in the presence of concurrency.

This change moves per-distribution calculations into a separate `@rule` to allow for reuse across multiple consumers, and moves to computing wheel contents using an external process to allow it to be cached run over run.


[ci skip-rust]
[ci skip-build-wheels]
stuhood added a commit that referenced this pull request Feb 23, 2022
#14551 improved the performance of local dist building when local distributions are actually present. But there are cases (which @benjyw is pursuing) where the `@rule` takes a long time to run, even when no dists are actually present. This is likely to do with the source subtraction: either the calculation of the subset paths, or the execution of `DigestSubset`.

[ci skip-rust]
[ci skip-build-wheels]
stuhood added a commit to stuhood/pants that referenced this pull request Feb 23, 2022
…uild#14564)

pantsbuild#14551 improved the performance of local dist building when local distributions are actually present. But there are cases (which @benjyw is pursuing) where the `@rule` takes a long time to run, even when no dists are actually present. This is likely to do with the source subtraction: either the calculation of the subset paths, or the execution of `DigestSubset`.

[ci skip-rust]
[ci skip-build-wheels]
stuhood added a commit that referenced this pull request Feb 23, 2022
…pick of #14564) (#14566)

#14551 improved the performance of local dist building when local distributions are actually present. But there are cases (which @benjyw is pursuing) where the `@rule` takes a long time to run, even when no dists are actually present. This is likely to do with the source subtraction: either the calculation of the subset paths, or the execution of `DigestSubset`.

[ci skip-rust]
[ci skip-build-wheels]
alonsodomin pushed a commit to alonsodomin/pants that referenced this pull request Feb 25, 2022
…ld#14551)

Currently, local distribution wheel contents are computed once per consumer, rather than once per distribution. Additionally, since the calculation of provided files is using `DigestContents`, it is briefly pulling the entire contents of wheels into memory. For small files, this might be fine: but larger dists can use a lot of memory, particularly in the presence of concurrency.

This change moves per-distribution calculations into a separate `@rule` to allow for reuse across multiple consumers, and moves to computing wheel contents using an external process to allow it to be cached run over run.
alonsodomin pushed a commit to alonsodomin/pants that referenced this pull request Feb 25, 2022
…uild#14564)

pantsbuild#14551 improved the performance of local dist building when local distributions are actually present. But there are cases (which @benjyw is pursuing) where the `@rule` takes a long time to run, even when no dists are actually present. This is likely to do with the source subtraction: either the calculation of the subset paths, or the execution of `DigestSubset`.

[ci skip-rust]
[ci skip-build-wheels]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants