Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High total runtime for test, even with very high cache rate #12

Closed
stuhood opened this issue Feb 8, 2022 · 16 comments
Closed

High total runtime for test, even with very high cache rate #12

stuhood opened this issue Feb 8, 2022 · 16 comments

Comments

@stuhood
Copy link

stuhood commented Feb 8, 2022

For larger number of tests, the startup time and overhead of pants in some private repositories can be very high. At the same time, memory usage is unexpectedly high (as first reported in pantsbuild/pants#12662).

Most likely the memory usage and total runtime are related, so pursuing the metrics gathering exercise from pantsbuild/pants#12662 would likely be very valuable in also tracking down the performance issue.

@stuhood stuhood self-assigned this Feb 18, 2022
@stuhood
Copy link
Author

stuhood commented Feb 21, 2022

At least for the cases I have profiles for, it looks like a very large fraction of the runtime for these cases is in: https://github.com/pantsbuild/pants/blob/e45b00d2eb467b599966bca262405a5d74d27bdd/src/python/pants/backend/python/util_rules/local_dists.py#L80-L83

Upon inspection, it looks like those @rules are pulling the entire contents of distributions into memory, and that they're doing so per-consumer (rather than per-dist). And because the work accomplished by those @rules is not executed as an external process, it is not persistently cached in the local/remote caches.

@stuhood
Copy link
Author

stuhood commented Feb 21, 2022

pantsbuild/pants#14551 addresses the local dist performance issue.

@stuhood
Copy link
Author

stuhood commented Feb 22, 2022

Cherrypicking to 2.10.x in pantsbuild/pants#14555

@stuhood
Copy link
Author

stuhood commented Feb 22, 2022

Have also opened pantsbuild/pants#14564, since it looks like in some cases there are no dists present at all.

@benjyw
Copy link
Member

benjyw commented Feb 23, 2022

I did some benchmarking on digest subsetting (see pantsbuild/pants#14569), and noticed the following:

  • generating the "full subset" by passing all paths in the digest in as the requested subset) takes ~1 second for 40k files and this time grows linearly with the number of files in the digest.
  • Creating a digest from file content in memory (via CreateDigest) takes ~8.3 seconds for 40k files, and this time grows roughly quadratically with the number of files in the digest.

So it is possible that full subset generation times of a few hundred ms add up across many consumers. But this benchmarking effort incidentally reveals something worse and not directly related to subsetting, i.e., superlinear behavior of CreateDigest.

@benjyw
Copy link
Member

benjyw commented Feb 23, 2022

Screen Shot 2022-02-23 at 1 12 09 AM

@benjyw
Copy link
Member

benjyw commented Feb 23, 2022

@benjyw
Copy link
Member

benjyw commented Feb 23, 2022

Of course this may be a red herring for this specific issue, since I'm not sure we ever CreateDigest on significant numbers of files outside a test setting. Nonetheless it would be instructive to know why we have quadratic behavior there, as maybe it means we have it elsewhere.

@stuhood
Copy link
Author

stuhood commented Feb 23, 2022

Thanks a lot!

Of course this may be a red herring for this specific issue, since I'm not sure we ever CreateDigest on significant numbers of files outside a test setting. Nonetheless it would be instructive to know why we have quadratic behavior there, as maybe it means we have it elsewhere.

Yea, I think that CreateDigest is implemented fairly naively, by creating an independent Digest per file, and then merging all of them. I'll watch out for other consumers in the profiles, but I don't think that the initial cases I saw used CreateDigest.

@stuhood
Copy link
Author

stuhood commented Feb 27, 2022

I've opened pantsbuild/pants#14638 to help address pantsbuild/pants#12662: there appears to be a small amount of overlap between memory usage and performance (less than I suspected when I opened this ticket), so we will likely break out pursuing pantsbuild/pants#12662 into a separate workstream while continuing to work on the performance issue captured here.

@stuhood
Copy link
Author

stuhood commented Feb 28, 2022

The next highest item in the profile was related to the snapshotting of sources:

pants.backend.python.goals.pytest_runner.run_python_test ->
pants.backend.python.goals.pytest_runner.setup_pytest_for_target (360) ->
pants.backend.python.util_rules.python_sources.prepare_python_sources (360) ->
    56,868,568,602
pants.core.util_rules.source_files.determine_source_files (360) ->
    40,450,644,707
pants.engine.internals.graph.hydrate_sources (7722) ->
    274,146,863,500
snapshot (4649) ->
    106,989,925,260
digest_file (5253)
    64,643,182,013

After investigating whether it would be useful to implement batching (I don't think so, because inference will still need to capture or isolate per-target source batches, which might mean two different APIs), I've begun pantsbuild/pants#13112, which should significantly reduce IO.

@stuhood
Copy link
Author

stuhood commented Mar 4, 2022

There has been further progress on pantsbuild/pants#13112: see pantsbuild/pants#13112 (comment) for a summary. The full stack of patches should be landable early next week, although some further optimization might be necessary to show the full potential.

@stuhood
Copy link
Author

stuhood commented Mar 10, 2022

Most of the effort for pantsbuild/pants#13112 has now been completed, and preliminary microbenchmarking shows that the particular codepath from #12 (comment) (which represented 33-66% of the total runtime) has been reduced by 56%.

So we should see anywhere between a 18% = 33%*56% and 36% = 66%*56% speedup in the production usecase in 2.11.x. Once we have new production profiles, we can resume work here, but for now I'm going to put it on hold.

@stuhood stuhood removed their assignment Mar 10, 2022
@stuhood
Copy link
Author

stuhood commented Mar 14, 2022

Users were able to test with pantsbuild/pants#13112, and confirmed a 30% speedup, which is great.

Source hydration / snapshotting is still the largest part of the profile (the codepath below pants.backend.python.util_rules.python_sources.prepare_python_sources is still pretty hot), but no longer by such a significant margin. I'm asking for a more detailed profile (with --streaming-workunits-level=trace).

The item next in the list below source hydration is time spent fetching from the (remote) cache. There is a known performance issue in that area, which is that we currently eagerly fetch all cache outputs before calling the cache hit successful. That is because we can't know whether we'll need them later, and we don't implement "backtracking" to re-run a process which we thought had hit but which actually missed -- covered by pantsbuild/pants#11331. It's not an issue of bandwidth (the total amount fetched from the cache in these cases is actually only ~2MB): rather, one of latency and bounded concurrency (controlled by --remote-store-rpc-concurrency).

@stuhood
Copy link
Author

stuhood commented Mar 17, 2022

New trace level profiles show that pants.backend.python.util_rules.python_sources.prepare_python_sources has been knocked out of first place in favor of dependency inference: in particular, the time taken to either execute or hit the cache for inference.

The hottest path below setup_pytest_for_target looks like:

[
    {
        "method": "pants.backend.python.goals.pytest_runner.setup_pytest_for_target",
        "count": 398,
        "duration_micros": 56748332960
    },
    {
        "method": "pants.engine.internals.graph.transitive_targets",
        "count": 398,
        "duration_micros": 37749039905
    },
    {
        "method": "pants.engine.internals.graph.transitive_dependency_mapping",
        "count": 398,
        "duration_micros": 37737757996
    },
    {
        "method": "pants.engine.internals.graph.resolve_targets",
        "count": 4585,
        "duration_micros": 46571189900
    },
    {
        "method": "pants.engine.internals.graph.resolve_unexpanded_targets",
        "count": 5009,
        "duration_micros": 46579084763
    },
    {
        "method": "pants.engine.internals.graph.resolve_dependencies",
        "count": 4585,
        "duration_micros": 46556427119
    },
    {
        "method": "pants.backend.python.dependency_inference.rules.infer_python_dependencies_via_imports",
        "count": 3216,
        "duration_micros": 44731238539
    },
    {
        "method": "pants.backend.python.dependency_inference.parse_python_imports.parse_python_imports",
        "count": 3216,
        "duration_micros": 44514117183
    },
    {
        "method": "pants.engine.process.fallible_to_exec_result_or_raise",
        "count": 3216,
        "duration_micros": 28124661088
    },
    {
        "method": "process",
        "count": 3224,
        "duration_micros": 23976103909
    }
]

The dropoff between parse_python_imports and fallible_to_exec_result_or_raise is the time spent actually capturing the sources: the time spent in process is primarily the time spent accessing and then hitting the cache(s).

Surprisingly, the breakdown of time within process execution is seemingly pretty bound on local io. The direct children of process look like:

[
    {
        "method": "remote_cache_read_speculation",
        "count": 3590,
        "duration_micros": 9230203432
    },
    {
        "method": "local_cache_read",
        "count": 3626,
        "duration_micros": 7594634338
    },
    {
        "method": "local_cache_write",
        "count": 3590,
        "duration_micros": 7303851431
    },
    {
        "method": "remote_cache_write",
        "count": 18,
        "duration_micros": 12707169
    },
    {
        "method": "remote_cache_write_setup",
        "count": 18,
        "duration_micros": 42
    }
]

Although the local cache read/write are not the hottest part of process execution, they are notable because we execute them first and last, synchronously, before doing remote cache lookups (whereas remote cache lookups occur in parallel with executing the process).

So, as a next step, it looks like we should actually try disabling the local cache entirely for this user, since reading/writing to it looks like a bottleneck. If that improves things, we might want to either make local cache lookups concurrent with remote cache lookups, and/or consider skipping local writes if we've hit remotely.

@stuhood
Copy link
Author

stuhood commented Apr 4, 2022

We've made significant progress on the performance of this case (from 10-12 minutes to 3-4 minutes), so although it is definitely not as low as we'd like it to be, I'm going to call this one resolved.

Once the dust settles on the 2.11.x release, and people are stably running with the new settings (in particular, with PEX-native lockfiles and the fix for pantsbuild/pants#14127 in place), we can revisit.

Please continue to open issues for the 2022 Performance Push. Thanks!

@stuhood stuhood closed this as completed Apr 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants