Calculate local distribution contents once per distribution #14551

stuhood · 2022-02-21T23:12:58Z

Currently, local distribution wheel contents are computed once per consumer, rather than once per distribution. Additionally, since the calculation of provided files is using DigestContents, it is briefly pulling the entire contents of wheels into memory. For small files, this might be fine: but larger dists can use a lot of memory, particularly in the presence of concurrency.

This change moves per-distribution calculations into a separate @rule to allow for reuse across multiple consumers, and moves to computing wheel contents using an external process to allow it to be cached run over run.

…nsumer. # Rust tests and lints will be skipped. Delete if not intended. [ci skip-rust] # Building wheels and fs_util will be skipped. Delete if not intended. [ci skip-build-wheels]

… extraction. [ci skip-rust] [ci skip-build-wheels]

stuhood · 2022-02-21T23:13:09Z

Commits are useful to review independently.

benjyw

Phew, nice one!

And finger wag at myself for writing the janky code to begin with.

benjyw · 2022-02-22T00:55:50Z

So is the reason the original rule wasn't satisfied from memoization/cache that the LocalDistsPexRequest.sources field is generally different for each consumer?

jsirois · 2022-02-22T01:53:08Z

src/python/pants/backend/python/util_rules/local_dists.py

+class LocalDistWheels:
+    """Contains the wheels isolated from a single local Python distribution."""
+
+    wheel_paths: list[str]


It may be true that wheel_paths are populated from a Snapshot below which happens to be sorted and it may be true that a set will have a stable order when formed multiple times in the same pantsd Python process - but both not encoding stable order and not encoding immutability in a "frozen" dataclass seems sub-optimal for sanity sake at the least.

Yea... I'll do this for consistency's sake.

But given how tightly we've adhered to functional patterns throughout the system, I do sometimes wonder whether defensively freezing dataclasses is worth it. It's clearly necessary when something will be used in a Get (for hash), but not necessary when something is only the return value of a @rule.

I'd argue if that's true its too subtle. If a rule returns a thing that thing ~must be used by another rule as an input and so you'd expect it better be immutable.

Put it another way, if I have to debug "strange" behavior in the engine, a 1st stop would be to look for violations of patterns in the Python rules. Being able to sanely rely on ~types would be a good 1st stop in that 1st stop. Seeing frozen != frozen would give me pause here without a comment.

Put it another way, if I have to debug "strange" behavior in the engine, a 1st stop would be to look for violations of patterns in the Python rules. Being able to sanely rely on ~types would be a good 1st stop in that 1st stop. Seeing frozen != frozen would give me pause here without a comment.

Sure. But we have had zero cases of those types of errors in the last few years... and whether that is due more to our defensive freezing of values, or due to mutating of a @rule-returned value just not being something that is likely/useful is the question that I'm referring to here.

stuhood · 2022-02-22T17:42:24Z

So is the reason the original rule wasn't satisfied from memoization/cache that the LocalDistsPexRequest.sources field is generally different for each consumer?

Correct: that, and the Sources field has an Address in it. The build_local_dists rule still needs to run for every consumer (as it stands), since it's comparing the transitive deps sources to the consumed wheels.

[ci skip-rust] [ci skip-build-wheels]

stuhood · 2022-02-22T18:57:37Z

@benjyw , @Eric-Arellano , @jsirois : Another thing to think about: rather than computing the transitive source dependencies, and then subtracting the set of transitive wheel dependencies, it would probably be more efficient to collect the transitive dependencies once in a distribution/wheel-vs-sources-aware way as we visited each node. That would be similar to the pattern behind the JVM recursively compiling CoarsenedTarget instances, and similar to what I think we might want to do for mypy.

EDIT: Opened #14561 about this.

…ld#14551) Currently, local distribution wheel contents are computed once per consumer, rather than once per distribution. Additionally, since the calculation of provided files is using `DigestContents`, it is briefly pulling the entire contents of wheels into memory. For small files, this might be fine: but larger dists can use a lot of memory, particularly in the presence of concurrency. This change moves per-distribution calculations into a separate `@rule` to allow for reuse across multiple consumers, and moves to computing wheel contents using an external process to allow it to be cached run over run. # Rust tests and lints will be skipped. Delete if not intended. [ci skip-rust] # Building wheels and fs_util will be skipped. Delete if not intended. [ci skip-build-wheels]

…ck of #14551) (#14555) Currently, local distribution wheel contents are computed once per consumer, rather than once per distribution. Additionally, since the calculation of provided files is using `DigestContents`, it is briefly pulling the entire contents of wheels into memory. For small files, this might be fine: but larger dists can use a lot of memory, particularly in the presence of concurrency. This change moves per-distribution calculations into a separate `@rule` to allow for reuse across multiple consumers, and moves to computing wheel contents using an external process to allow it to be cached run over run. [ci skip-rust] [ci skip-build-wheels]

@benjyw

#14551 improved the performance of local dist building when local distributions are actually present. But there are cases (which @benjyw is pursuing) where the `@rule` takes a long time to run, even when no dists are actually present. This is likely to do with the source subtraction: either the calculation of the subset paths, or the execution of `DigestSubset`. [ci skip-rust] [ci skip-build-wheels]

@benjyw

…uild#14564) pantsbuild#14551 improved the performance of local dist building when local distributions are actually present. But there are cases (which @benjyw is pursuing) where the `@rule` takes a long time to run, even when no dists are actually present. This is likely to do with the source subtraction: either the calculation of the subset paths, or the execution of `DigestSubset`. [ci skip-rust] [ci skip-build-wheels]

@benjyw

…pick of #14564) (#14566) #14551 improved the performance of local dist building when local distributions are actually present. But there are cases (which @benjyw is pursuing) where the `@rule` takes a long time to run, even when no dists are actually present. This is likely to do with the source subtraction: either the calculation of the subset paths, or the execution of `DigestSubset`. [ci skip-rust] [ci skip-build-wheels]

…ld#14551) Currently, local distribution wheel contents are computed once per consumer, rather than once per distribution. Additionally, since the calculation of provided files is using `DigestContents`, it is briefly pulling the entire contents of wheels into memory. For small files, this might be fine: but larger dists can use a lot of memory, particularly in the presence of concurrency. This change moves per-distribution calculations into a separate `@rule` to allow for reuse across multiple consumers, and moves to computing wheel contents using an external process to allow it to be cached run over run.

@benjyw

…uild#14564) pantsbuild#14551 improved the performance of local dist building when local distributions are actually present. But there are cases (which @benjyw is pursuing) where the `@rule` takes a long time to run, even when no dists are actually present. This is likely to do with the source subtraction: either the calculation of the subset paths, or the execution of `DigestSubset`. [ci skip-rust] [ci skip-build-wheels]

stuhood added 2 commits February 21, 2022 14:39

Isolate wheel contents once per distribution, rather than once per co…

aab80e7

…nsumer. # Rust tests and lints will be skipped. Delete if not intended. [ci skip-rust] # Building wheels and fs_util will be skipped. Delete if not intended. [ci skip-build-wheels]

Compute file listings via an external process to allow for caching of…

00c1aa1

… extraction. [ci skip-rust] [ci skip-build-wheels]

stuhood requested review from jsirois, benjyw and Eric-Arellano February 21, 2022 23:12

stuhood mentioned this pull request Feb 21, 2022

High total runtime for test, even with very high cache rate toolchainlabs/issues#12

Closed

stuhood added the needs-cherrypick label Feb 21, 2022

stuhood added this to the 2.10.x milestone Feb 21, 2022

benjyw approved these changes Feb 22, 2022

View reviewed changes

jsirois approved these changes Feb 22, 2022

View reviewed changes

Eric-Arellano approved these changes Feb 22, 2022

View reviewed changes

Review feedback and test fix.

1f347c1

[ci skip-rust] [ci skip-build-wheels]

Eric-Arellano approved these changes Feb 22, 2022

View reviewed changes

stuhood enabled auto-merge (squash) February 22, 2022 18:05

stuhood merged commit 40934ce into pantsbuild:main Feb 22, 2022

stuhood deleted the stuhood/local-dists-perf branch February 22, 2022 18:53

stuhood mentioned this pull request Feb 22, 2022

Shortcircuit local distribution source subsetting if there are no dists. #14564

Merged

stuhood mentioned this pull request Feb 23, 2022

Shortcircuit source subsetting if there are no distributions. (cherrypick of #14564) #14566

Merged

stuhood removed the needs-cherrypick label Mar 3, 2022

benjyw mentioned this pull request Apr 2, 2022

Pex binary errors when depending on a distribution #14983

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Calculate local distribution contents once per distribution #14551

Calculate local distribution contents once per distribution #14551

stuhood commented Feb 21, 2022

stuhood commented Feb 21, 2022

benjyw left a comment

benjyw commented Feb 22, 2022

jsirois Feb 22, 2022 •

edited

Loading

stuhood Feb 22, 2022

jsirois Feb 22, 2022

stuhood Feb 22, 2022

stuhood commented Feb 22, 2022

stuhood commented Feb 22, 2022 •

edited

Loading

Calculate local distribution contents once per distribution #14551

Calculate local distribution contents once per distribution #14551

Conversation

stuhood commented Feb 21, 2022

stuhood commented Feb 21, 2022

benjyw left a comment

Choose a reason for hiding this comment

benjyw commented Feb 22, 2022

jsirois Feb 22, 2022 • edited Loading

Choose a reason for hiding this comment

stuhood Feb 22, 2022

Choose a reason for hiding this comment

jsirois Feb 22, 2022

Choose a reason for hiding this comment

stuhood Feb 22, 2022

Choose a reason for hiding this comment

stuhood commented Feb 22, 2022

stuhood commented Feb 22, 2022 • edited Loading

jsirois Feb 22, 2022 •

edited

Loading

stuhood commented Feb 22, 2022 •

edited

Loading