High total runtime for `test`, even with very high cache rate #12

stuhood · 2022-02-08T20:18:26Z

For larger number of tests, the startup time and overhead of pants in some private repositories can be very high. At the same time, memory usage is unexpectedly high (as first reported in pantsbuild/pants#12662).

Most likely the memory usage and total runtime are related, so pursuing the metrics gathering exercise from pantsbuild/pants#12662 would likely be very valuable in also tracking down the performance issue.

The text was updated successfully, but these errors were encountered:

stuhood · 2022-02-21T21:36:23Z

At least for the cases I have profiles for, it looks like a very large fraction of the runtime for these cases is in: https://github.com/pantsbuild/pants/blob/e45b00d2eb467b599966bca262405a5d74d27bdd/src/python/pants/backend/python/util_rules/local_dists.py#L80-L83

Upon inspection, it looks like those @rules are pulling the entire contents of distributions into memory, and that they're doing so per-consumer (rather than per-dist). And because the work accomplished by those @rules is not executed as an external process, it is not persistently cached in the local/remote caches.

stuhood · 2022-02-21T23:13:23Z

pantsbuild/pants#14551 addresses the local dist performance issue.

stuhood · 2022-02-22T18:59:21Z

Cherrypicking to 2.10.x in pantsbuild/pants#14555

stuhood · 2022-02-22T23:58:58Z

Have also opened pantsbuild/pants#14564, since it looks like in some cases there are no dists present at all.

benjyw · 2022-02-23T09:27:39Z

I did some benchmarking on digest subsetting (see pantsbuild/pants#14569), and noticed the following:

generating the "full subset" by passing all paths in the digest in as the requested subset) takes ~1 second for 40k files and this time grows linearly with the number of files in the digest.
Creating a digest from file content in memory (via CreateDigest) takes ~8.3 seconds for 40k files, and this time grows roughly quadratically with the number of files in the digest.

So it is possible that full subset generation times of a few hundred ms add up across many consumers. But this benchmarking effort incidentally reveals something worse and not directly related to subsetting, i.e., superlinear behavior of CreateDigest.

benjyw · 2022-02-23T09:27:58Z

benjyw · 2022-02-23T09:28:28Z

See https://docs.google.com/spreadsheets/d/1f94uT7ZIwkMWewGbKtd-cfVlKzuGPv5Epj3lOZXRZ60/edit?usp=sharing

benjyw · 2022-02-23T15:57:11Z

Of course this may be a red herring for this specific issue, since I'm not sure we ever CreateDigest on significant numbers of files outside a test setting. Nonetheless it would be instructive to know why we have quadratic behavior there, as maybe it means we have it elsewhere.

stuhood · 2022-02-23T17:31:00Z

Thanks a lot!

Of course this may be a red herring for this specific issue, since I'm not sure we ever CreateDigest on significant numbers of files outside a test setting. Nonetheless it would be instructive to know why we have quadratic behavior there, as maybe it means we have it elsewhere.

Yea, I think that CreateDigest is implemented fairly naively, by creating an independent Digest per file, and then merging all of them. I'll watch out for other consumers in the profiles, but I don't think that the initial cases I saw used CreateDigest.

stuhood · 2022-02-27T01:04:12Z

I've opened pantsbuild/pants#14638 to help address pantsbuild/pants#12662: there appears to be a small amount of overlap between memory usage and performance (less than I suspected when I opened this ticket), so we will likely break out pursuing pantsbuild/pants#12662 into a separate workstream while continuing to work on the performance issue captured here.

stuhood · 2022-02-28T04:13:06Z

The next highest item in the profile was related to the snapshotting of sources:

pants.backend.python.goals.pytest_runner.run_python_test ->
pants.backend.python.goals.pytest_runner.setup_pytest_for_target (360) ->
pants.backend.python.util_rules.python_sources.prepare_python_sources (360) ->
    56,868,568,602
pants.core.util_rules.source_files.determine_source_files (360) ->
    40,450,644,707
pants.engine.internals.graph.hydrate_sources (7722) ->
    274,146,863,500
snapshot (4649) ->
    106,989,925,260
digest_file (5253)
    64,643,182,013

After investigating whether it would be useful to implement batching (I don't think so, because inference will still need to capture or isolate per-target source batches, which might mean two different APIs), I've begun pantsbuild/pants#13112, which should significantly reduce IO.

stuhood · 2022-03-04T21:58:10Z

There has been further progress on pantsbuild/pants#13112: see pantsbuild/pants#13112 (comment) for a summary. The full stack of patches should be landable early next week, although some further optimization might be necessary to show the full potential.

stuhood · 2022-03-10T22:09:58Z

Most of the effort for pantsbuild/pants#13112 has now been completed, and preliminary microbenchmarking shows that the particular codepath from #12 (comment) (which represented 33-66% of the total runtime) has been reduced by 56%.

So we should see anywhere between a 18% = 33%*56% and 36% = 66%*56% speedup in the production usecase in 2.11.x. Once we have new production profiles, we can resume work here, but for now I'm going to put it on hold.

stuhood · 2022-03-14T19:36:41Z

Users were able to test with pantsbuild/pants#13112, and confirmed a 30% speedup, which is great.

Source hydration / snapshotting is still the largest part of the profile (the codepath below pants.backend.python.util_rules.python_sources.prepare_python_sources is still pretty hot), but no longer by such a significant margin. I'm asking for a more detailed profile (with --streaming-workunits-level=trace).

The item next in the list below source hydration is time spent fetching from the (remote) cache. There is a known performance issue in that area, which is that we currently eagerly fetch all cache outputs before calling the cache hit successful. That is because we can't know whether we'll need them later, and we don't implement "backtracking" to re-run a process which we thought had hit but which actually missed -- covered by pantsbuild/pants#11331. It's not an issue of bandwidth (the total amount fetched from the cache in these cases is actually only ~2MB): rather, one of latency and bounded concurrency (controlled by --remote-store-rpc-concurrency).

stuhood · 2022-03-17T19:54:03Z

New trace level profiles show that pants.backend.python.util_rules.python_sources.prepare_python_sources has been knocked out of first place in favor of dependency inference: in particular, the time taken to either execute or hit the cache for inference.

The hottest path below setup_pytest_for_target looks like:

[
    {
        "method": "pants.backend.python.goals.pytest_runner.setup_pytest_for_target",
        "count": 398,
        "duration_micros": 56748332960
    },
    {
        "method": "pants.engine.internals.graph.transitive_targets",
        "count": 398,
        "duration_micros": 37749039905
    },
    {
        "method": "pants.engine.internals.graph.transitive_dependency_mapping",
        "count": 398,
        "duration_micros": 37737757996
    },
    {
        "method": "pants.engine.internals.graph.resolve_targets",
        "count": 4585,
        "duration_micros": 46571189900
    },
    {
        "method": "pants.engine.internals.graph.resolve_unexpanded_targets",
        "count": 5009,
        "duration_micros": 46579084763
    },
    {
        "method": "pants.engine.internals.graph.resolve_dependencies",
        "count": 4585,
        "duration_micros": 46556427119
    },
    {
        "method": "pants.backend.python.dependency_inference.rules.infer_python_dependencies_via_imports",
        "count": 3216,
        "duration_micros": 44731238539
    },
    {
        "method": "pants.backend.python.dependency_inference.parse_python_imports.parse_python_imports",
        "count": 3216,
        "duration_micros": 44514117183
    },
    {
        "method": "pants.engine.process.fallible_to_exec_result_or_raise",
        "count": 3216,
        "duration_micros": 28124661088
    },
    {
        "method": "process",
        "count": 3224,
        "duration_micros": 23976103909
    }
]

The dropoff between parse_python_imports and fallible_to_exec_result_or_raise is the time spent actually capturing the sources: the time spent in process is primarily the time spent accessing and then hitting the cache(s).

Surprisingly, the breakdown of time within process execution is seemingly pretty bound on local io. The direct children of process look like:

[
    {
        "method": "remote_cache_read_speculation",
        "count": 3590,
        "duration_micros": 9230203432
    },
    {
        "method": "local_cache_read",
        "count": 3626,
        "duration_micros": 7594634338
    },
    {
        "method": "local_cache_write",
        "count": 3590,
        "duration_micros": 7303851431
    },
    {
        "method": "remote_cache_write",
        "count": 18,
        "duration_micros": 12707169
    },
    {
        "method": "remote_cache_write_setup",
        "count": 18,
        "duration_micros": 42
    }
]

Although the local cache read/write are not the hottest part of process execution, they are notable because we execute them first and last, synchronously, before doing remote cache lookups (whereas remote cache lookups occur in parallel with executing the process).

So, as a next step, it looks like we should actually try disabling the local cache entirely for this user, since reading/writing to it looks like a bottleneck. If that improves things, we might want to either make local cache lookups concurrent with remote cache lookups, and/or consider skipping local writes if we've hit remotely.

stuhood · 2022-04-04T17:49:52Z

We've made significant progress on the performance of this case (from 10-12 minutes to 3-4 minutes), so although it is definitely not as low as we'd like it to be, I'm going to call this one resolved.

Once the dust settles on the 2.11.x release, and people are stably running with the new settings (in particular, with PEX-native lockfiles and the fix for pantsbuild/pants#14127 in place), we can revisit.

Please continue to open issues for the 2022 Performance Push. Thanks!

stuhood self-assigned this Feb 18, 2022

stuhood removed their assignment Mar 10, 2022

stuhood closed this as completed Apr 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High total runtime for `test`, even with very high cache rate #12

High total runtime for `test`, even with very high cache rate #12

stuhood commented Feb 8, 2022 •

edited

Loading

stuhood commented Feb 21, 2022

stuhood commented Feb 21, 2022

stuhood commented Feb 22, 2022

stuhood commented Feb 22, 2022

benjyw commented Feb 23, 2022

benjyw commented Feb 23, 2022

benjyw commented Feb 23, 2022 •

edited

Loading

benjyw commented Feb 23, 2022 •

edited

Loading

stuhood commented Feb 23, 2022

stuhood commented Feb 27, 2022

stuhood commented Feb 28, 2022 •

edited

Loading

stuhood commented Mar 4, 2022

stuhood commented Mar 10, 2022

stuhood commented Mar 14, 2022 •

edited

Loading

stuhood commented Mar 17, 2022

stuhood commented Apr 4, 2022

High total runtime for test, even with very high cache rate #12

High total runtime for test, even with very high cache rate #12

Comments

stuhood commented Feb 8, 2022 • edited Loading

stuhood commented Feb 21, 2022

stuhood commented Feb 21, 2022

stuhood commented Feb 22, 2022

stuhood commented Feb 22, 2022

benjyw commented Feb 23, 2022

benjyw commented Feb 23, 2022

benjyw commented Feb 23, 2022 • edited Loading

benjyw commented Feb 23, 2022 • edited Loading

stuhood commented Feb 23, 2022

stuhood commented Feb 27, 2022

stuhood commented Feb 28, 2022 • edited Loading

stuhood commented Mar 4, 2022

stuhood commented Mar 10, 2022

stuhood commented Mar 14, 2022 • edited Loading

stuhood commented Mar 17, 2022

stuhood commented Apr 4, 2022

High total runtime for `test`, even with very high cache rate #12

High total runtime for `test`, even with very high cache rate #12

stuhood commented Feb 8, 2022 •

edited

Loading

benjyw commented Feb 23, 2022 •

edited

Loading

benjyw commented Feb 23, 2022 •

edited

Loading

stuhood commented Feb 28, 2022 •

edited

Loading

stuhood commented Mar 14, 2022 •

edited

Loading