Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: report unused inputs for the tar rule #951

Merged
merged 13 commits into from
Oct 13, 2024

Conversation

plobsing
Copy link
Contributor

@plobsing plobsing commented Sep 27, 2024

The mtree spec passed to the tar rule very often selects a subset of the inputs made available through the srcs attribute. In many cases, these subsets do not break down cleanly along dependency-tree lines and there is no simple way just pass less content to the tar rule.

One prominent example where this occurs is when constructing the tars for OCI image layers. For instance when building a Python-based container image, we might want to split the Python interpreter, third-party dependencies, and application code into their own layers. This is done by filtering the mtree_spec.

However, in the operation to construct a tar from a subsetted mtree, it is usually still an unsubsetted tree of srcs that gets passed. As a result, the subset tarball is considered dependent upon a larger set of sources than is strictly necessary.

This over-scoping runs counter to a very common objective associated with breaking up an image into layers - isolating churn to a smaller slice of the application. Because of the spurious relationships established in Bazel's dependency graph, all tars get rebuilt anytime any content in the application gets changed. Tar rebuilds can even be triggered by changes to files that are completely filtered-out from all layers of the container.

Redundant creation of archive content is usually not too computationally intensive, but the archives can be quite large in some cases, and avoiding a rebuild might free up gigabytes of disk and/or network bandwidth for
better use. In addition, eliminating the spurious dependency edges removes erroneous constraints applied to the build action schedule; these tend to push all Tar-building operations towards the end of a build, even when some archive construction could be scheduled much earlier.

Risk assessment and mitigation

The unused_inputs_list mechanism used to report spurious dependency relationships is a bit difficult to use. Reporting an actually-used input as unused can create difficult to diagnose problems down the line.

However, the behaviour of the mtree-based tar rule is sufficiently simple and self-contained that I am fairly confident that this rule's used/unused set can be determined accurately in a maintainable fashion.

Out of an abundance of caution I have gated this feature behind a default-off flag. The tar rule will continue to operate as it had before - typically over-reporting dependencies - unless the --@aspect_bazel_lib//lib:tar_compute_unused_inputs flag is passed.

Filter accuracy

The vis encoding used by the mtree format to resiliently handle path names has a small amount of "play" to it - it is reversible but the encoded representation of a string is not
unique. Two unequal encoded strings might decode to the same value; this can happen when at least one of the encoded strings contains unnecessary escapes that are nevertheless honoured by the decoder.

The unused-inputs set is determined using a filter that compares vis-encoded strings. In the presence of non-canonically-encoded paths, false-mismatches can lead to falsely reporting that an input is unused.

The only vis-encoded path content that is under the control of callers is the mtree content itself; all other vis-encoded strings are constructed internally to this package, not exposed publicly, and are all derived using the lib/private/tar.bzl%_vis_encode function; all of these paths are expected to compare exactly. Additionally, it is expected that many/most users will use this package's helpers (e.g. mtree_spec) when crafting their mtree content; such content is also safe. It is only when the user crafts their own mtree, or modifies an mtree spec's content= fields' encoding in some way, that a risk of inaccurate reporting arises. The chances for this are expected to be minor since this seems like an inconvenient and not-particularly-useful thing for a user to go out of their way to do.

Copy link
Collaborator

@alexeagle alexeagle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems generally great to me. @thesayyn want to TAL?

lib/BUILD.bazel Show resolved Hide resolved
lib/private/tar.bzl Show resolved Hide resolved
Copy link
Collaborator

@thesayyn thesayyn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is brilliant

lib/private/tar.bzl Outdated Show resolved Hide resolved
The `mtree` spec passed to the `tar` rule very often selects a subset of the
inputs made available through the `srcs` attribute. In many cases, these
subsets do not break down cleanly along dependency-tree lines and there
is no simple way just pass less content to the `tar` rule.

One prominent example where this occurs is when constructing the tars
for OCI image layers. For instance when [building a Python-based
container image](https://github.com/bazel-contrib/rules_oci/blob/main/docs/python.md),
we might want to split the Python interpreter, third-party dependencies, and
application code into their own layers. This is done by [filtering the
`mtree_spec`](https://github.com/aspect-build/bazel-examples/blob/85cb2aaf8c6e51d5e9e086cc94b94ab896903fb0/oci_python_image/py_layer.bzl#L39).

However, in the operation to construct a `tar` from a subsetted mtree,
it is usually still an unsubsetted tree of `srcs` that gets passed. As
a result, the subset tarball is considered dependent upon a larger set
of sources than is strictly necessary.

This over-scoping runs counter to a very common objective associated with
breaking up an image into layers - isolating churn to a smaller slice of
the application. Because of the spurious relationships established in
Bazel's dependency graph, all tars get rebuilt anytime any content in
the application gets changed. Tar rebuilds can even be triggered by
changes to files that are completely filtered-out from all layers of the container.

Redundent creation of archive content is usually not too computationally
intensive, but the archives can be quite large in some cases, and
avoiding a rebuild might free up gigabytes of disk and/or network
bandwidth for
better use. In addition, eliminating the spurious dependency edges
removes erroneous constraints applied to the build action schedule;
these tend to push all Tar-building operations towards the end of a
build, even when some archive construction could be scheduled much earlier.

## Risk assessment and mitigation

The `unused_inputs_list` mechanism used to report spurious dependency
relationships is a bit difficult to use. Reporting an actually-used
input as unused can create difficult to diagnose problems down the line.

However, the behaviour of the `mtree`-based `tar` rule is sufficiently
simple and self-contained that I am fairly confident that this rule's
used/unused set can be determined accurately in a maintainable fashion.

Out of an abundance of caution I have gated this feature behind a
default-off flag. The `tar` rule will continue to operate as it had
before - typically over-reporting dependencies - unless the
`--@aspect_bazel_lib//lib:tar_compute_unused_inputs` flag is passed.

### Filter accuracy

The `vis` encoding used by the `mtree` format to resiliently handle path
names has a small amount of "play" to it - it is reversable but the
encoded representation of a string is not
unique. Two unequal encoded strings might decode to the same value; this
can happen when at least one of the encoded strings contains unnecessary
escapes that are nevertheless honoured by the decoder.

The unused-inputs set is determined using a filter that compares
`vis`-encoded strings. In the presence of non-canonically-encoded
paths, false-mismatches can lead to falsely reporting that an input is
unused.

The only `vis`-encoded path content that is under the control of callers
is the `mtree` content itself; all other `vis`-encoded strings are
constructed internally to this package, not exposed publicly, and are
all derived using the `lib/private/tar.bzl%_vis_encode` function; all of
these paths are expected to compare exactly. Additionally, it is expected that
many/most users will use this package's helpers (e.g. `mtree_spec`) when
crafting their mtree content; such content is also safe. It is only when
the user crafts their own mtree, or modifies an mtree spec's `content=`
fields' encoding in some way, that a risk of inaccurate reporting
arises. The chances for this are expected to be minor since this seems
like an inconvenient and not-particularly-useful thing for a user to go
out of their way to do.
This control surface provides for granular control of the feature. The
interface is selected to mirror the common behaviour of `stamp` attributes.
This is accepted by bsdtar/libarchive. In fact `contents=` is the only of
the pair documented in `mtree(5)`; `content=` is an undocumented
alternate form supported by libarchive.
Bazel's interpretation of unused_inputs_list cannot accomodate certain
things in filenames. These are also likely to mess up our own
line-oriented protocol in the shellscript that produces this file.
@plobsing
Copy link
Contributor Author

plobsing commented Oct 6, 2024

FYI, after having a bit more time to think about this, I think minor tweaks could make this a lot more exact and a lot less risky. I've iterated on areas that were inexact that were internal to the implementation.

What remains is exactness WRT vis encoding, which is a bigger issue (affects _mtree_line too). We can't do full Unicode correctly (because of #794 ) but we could probably cover the full 7-bit ASCII range. Combined with a check for non-ASCII that fails the build when computing the unused inputs set (you'd have to disable input pruning if you work with exotically-named files), I think that could make this feature safe enough to become a default (in v3.0). WDYT? I'll leave that to a later change though, this change is large enough as it stands.

lib/private/tar.bzl Outdated Show resolved Hide resolved
Copy link
Collaborator

@thesayyn thesayyn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent!

plobsing and others added 2 commits October 8, 2024 15:05
Whoops.

Co-authored-by: Sahin Yort <thesayyn@gmail.com>
@alexeagle alexeagle merged commit bca34bd into bazel-contrib:main Oct 13, 2024
22 checks passed
@plobsing plobsing deleted the tar_compute_unused_inputs branch October 14, 2024 06:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants