Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hermetic toolchains incur a significant overhead to running py_binary and py_test #1653

Open
martis42 opened this issue Dec 21, 2023 · 10 comments

Comments

@martis42
Copy link
Contributor

🐞 bug report

Affected Rule

py_binary and py_test

Is this a regression?

No

Description

Hermetic toolchains incur a significant overhead to executing Python targets due to the large amount of files which are added to the runfiles tree.
You find an analysis of the issue here: https://github.com/martis42/show_hermetic_python_overhead

Bazel caching of course negates much of this issue in daily working.
Still, whenever a test has to rerun we pay this additional overhead.

I don't know if there is a "perfect" solution possible given how Bazel sandboxing works.
However, it would be great if rues_python would add a note in its documentation for hermetic toolchains that adding --nolegacy_external_runfiles to the .bazelrc file is recommended.

Maybe the rules_python maintainer can even convince the Bazel maintainer to prioritize working on flipping --legacy_external_runfiles in Bazel 8 ?

🔬 Minimal Reproduction

You see reproduce this with this small example workspace https://github.com/martis42/show_hermetic_python_overhead

🔥 Exception or Error

NA

🌍 Your Environment

Operating System:

  
Linux Mint 21.1
  

Output of bazel version:

  
Bazelisk version: v1.16.0
Build label: 7.0.0
Build target: @@//src/main/java/com/google/devtools/build/lib/bazel:BazelServer
Build time: Mon Dec 11 16:51:49 2023 (1702313509)
Build timestamp: 1702313509
Build timestamp as int: 1702313509
  

Rules_python version:

  
0.27.0
  

Anything else relevant?

#1624 might be related.
It is however maybe Windows specific.

@rickeylev
Copy link
Contributor

Yeah, the caching can only do so well. The stdlib is a decent size. It's just a lot of files that have to symlinked.

I wonder if we'd be able to zip up the stdlib. I'm think that can Just Work.

Otherwise, maybe declare_symlink or symlink(target_path=...) (which are now out of experimental in bazel 7, I think) could be used to add one file to the runtime's set of files, and it just points back to the location of the runtime. I haven't had a change to try those out, though, so not sure about any downsides they have.

I really wish we had directories as a first-class concept so Bazel knew to materialize a single symlink, but to monitor the contents of it recursively.

@rickeylev
Copy link
Contributor

A few of us were experimenting with symlinks and declare_directory (TreeArtifacts). The result is not really promising.

For local strategy actions, it works very well: TreeArtifacts create a single symlink to the directory; very cheap! Unfortunately, sandboxed actions don't do that; they copy the whole directory and/or symlink each individual file. We tried to trick bazel into create a symlink, but no luck.

The declare_symlink option has similar problems. It can create an arbitrary symlink, but a rule can't really know the correct absolute path to the directory.

Some maintainers were talking, and zipping the stdlib is probably the best option available. That would greatly reduce the number of files. A few companies already do this. A PR implementing this would be welcome.

Digging around the files in the python install, I also think that maybe 25-50% of them aren't necessary for running a program. e.g., all the vendored pip dependencies and interpreter headers don't need to be in the runfiles to run a binary.

On the Bazel side, there is bazelbuild/bazel#8230 which is approximately about the cost of creating large sets of files and is one of the better issues demonstrating that symlinking to a directory would be a big performance benefit; this issue also affects js/node

Copy link

This issue has been automatically marked as stale because it has not had any activity for 180 days. It will be closed if no further activity occurs in 30 days.
Collaborators can add an assignee to keep this open indefinitely. Thanks for your contributions to rules_python!

@github-actions github-actions bot added the Can Close? Will close in 30 days if there is no new activity label Jul 10, 2024
@martis42
Copy link
Contributor Author

This is still relevant

@github-actions github-actions bot removed the Can Close? Will close in 30 days if there is no new activity label Jul 11, 2024
@rickeylev
Copy link
Contributor

Question: is using a locally installed platform runtime, instead of a remote downloaded in-build runtime, an appealing option to this problem?

I ask because I've merged some of the basic for using a locally installed runtime (one equivalent to what our hermetic runtimes do) in PR #2000. But, we're trying to figure out some of the use cases for it to better understand what public APIs it needs.

@martis42
Copy link
Contributor Author

martis42 commented Jul 17, 2024

... an appealing option to this problem?

No it is not. Technically it would solve the problem, at least on Linux (I am not working with other platforms). However, besides managing a polyglot workspace Bazel's appeal and big selling point are in my point of view hermeticity and thus reproducibility of actions, which enables the awesome caching performance. A host Interpreter toolchain is to my understanding defeating hermeticity.

That much said, I am still very much interested in rules_python offering better support for host Interpreter toolchains. There are some known quirks with the hermetic toolchain. Switching to the host Interpreter is an easy workaround for those (e. g. during debugging). While it is easy enough to set the host interpreter itself, what is missing are the toolchain files defined here, e.g. python_headers. Due to this in the project I work with we have our own host toolchains defined and a build setting to opt into using the host toolchain instead of the hermetic one.

@rickeylev
Copy link
Contributor

what is missing are the toolchain definitions, headers ... we have our own host toolchains defined

Oh that's perfect. That's exactly the sort of case I had in mind. PR #2000 added the basics to do all that, though the APIs are approximately private. I created #2070 to collect user feedback.

I'd be particularly interested in if you're doing anything to automatically locate e.g. headers and libraries to wire them into the toolchain. Inspecting the Python install to figure that out was quite a mystery, and I'm really uncertain what edge cases I missed.

@martis42
Copy link
Contributor Author

We are not dynamically discovering headers and libraries. Currently we only support host toolchains for Linux (Ubuntu) and a subset of Python versions. For this use case it was easier to have dedicated BUILD file per toolchain and let users select them explicitly.
From my experience the BUILD files differ only with the version number in the paths to the toolchain files but not structurally. At least for Ubuntu hosts a generic solution should be trivial.

One caveat is that multiple Debian packages are involved in a full Python toolchain. Not everybody might have python3-dev installed. To make sure users don't experience hard to understand errors it might help to search for some key file in a repository rule and provide a verbose error if they are missing.

@rickeylev
Copy link
Contributor

rickeylev commented Jul 17, 2024

(back to OP issue)
I was told that the 7.2 release of Bazel contained some performance improvements to sandboxing, which should help with the overhead of creating files. There's also a flag that might make things faster: --experimental_inmemory_sandbox_stashes

@martis42
Copy link
Contributor Author

I used my artificial benchmark to look into this issue again.

--experimental_inmemory_sandbox_stashes does not improve performance for me. Maybe the flag only improves performance in specific use cases which do not fit my system. However, when comparing Bazel versions one sees that Bazel indeed became more efficient for setting up and executing sandboxed tests. For my test setup I saw the following numbers:

Bazel version Elapsed time
6.5.0 54,3 s
7.0.0 48,0 s
7.2.1 36.4 s
8.0.0-pre.20240718.2 36.9 s

This is great, as this is a free improvement for everybody by simply using a recent Bazel version.
Still, compared to using a host interpreter the overhead is significant. I still believe Bazel should support an optimized concept to linking toolchain artifacts into the runfiles tree.

martis42 added a commit to martis42/depend_on_what_you_use that referenced this issue Aug 5, 2024
We now use the lock file for local optimization, but do not check it in
yet. We will try out this feature a bit before doing so.

Bazel 7.2.1 also offers performance improvements, see:
bazelbuild/rules_python#1653 (comment)
martis42 added a commit to martis42/depend_on_what_you_use that referenced this issue Aug 7, 2024
We now use the lock file for local optimization, but do not check it in
yet. We will try out this feature a bit before doing so.

Bazel 7.2.1 also offers performance improvements, see:
bazelbuild/rules_python#1653 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants