Skip to content
This repository has been archived by the owner on Oct 2, 2023. It is now read-only.

Add md5sums file list to distroless container #2065

Merged

Conversation

pombredanne
Copy link
Contributor

This ensures that a "distroless" container layer tarball built from
Debian packages contains not only the control file of each package, but
also the md5sums file that lists original files included in a package.
The md5sums file is extracted from a .deb package and saved side-by-side
with the control under this path:
var/lib/dpkg/status.d/.md5sums

Reference: #1876
Signed-off-by: Philippe Ombredanne pombredanne@nexb.com

PR Checklist

Please check if your PR fulfills the following requirements:

  • Tests for the changes have been added (for bug fixes / features)
  • Docs have been added / updated (for bug fixes / features)

PR Type

What kind of change does this PR introduce?

  • Bugfix
  • Feature
  • Code style update (formatting, local variables)
  • Refactoring (no functional changes, no api changes)
  • Build related changes
  • CI related changes
  • Documentation content changes
  • Other... Please describe:

What is the current behavior?

Issue Number: #1876

The current behaviour is to skip including the md5sum listing the list of package files when installing a Debian package in a distroless container image.

What is the new behavior?

The new behaviour is to included the md5sum listing the list of package files when installing a Debian package in a distroless container image, side-by-side with the control file.

Does this PR introduce a breaking change?

  • Yes
  • No

Other information

I used only conservative Python syntax and kept the same code style as before.
There was no specific documentation entry, so I added a docstring

@pombredanne
Copy link
Contributor Author

@loosebazooka gentle ping. You asked for a patch here in GoogleContainerTools/distroless#741 (comment)

Oh sorry, yeah I mean I don't know why this form of metadata was chosen. Anyway, it seems like the correct place to inject the metadata is in rules_docker. Please provide a patch there.

Do you mind to have a look before it goes stale?
This is fairly straight forward and is tested.

@pombredanne
Copy link
Contributor Author

@smukherj1 @alexeagle @pcj @gravypod gentle ping too. Some kind of review (and approval) would be very nice!

if not control_file_member:
raise self.DebError(deb + ' does not Metadata File!')
raise self.DebError(deb + ' does not contain a control Metadata File!')
control_file = tar.extractfile(control_file_member[0])
metadata = b''.join(control_file.readlines())
Copy link
Contributor

@aiuto aiuto May 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels like this could be more readably expressed as

metadata = control_file.read().decode("utf-8")
pkg_name = TarFile.parse_pkg_name(metadata, deb)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aiuto good point. Done in the latest push. I also rebased. Thank you for the review!

Copy link
Contributor

@aiuto aiuto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, but I don't maintain this repo.
Again, repeating the friendly ping.

@pombredanne
Copy link
Contributor Author

pombredanne commented May 25, 2022

Joshua Katz asked on Slack:

Might I ask: what is the use case here? What do these files do?

Here is the skinny:

  • The background is explained in this original issue Keep lists of packaged-installed files inside a built image GoogleContainerTools/distroless#741
  • A Debian package comes with an actual payload of code, data, docs and with metadata. The metadata is essentially a control file with name/value pairs like the name and version of a package, its description, etc. And a list of the files installed in the payload tarball. This is an md5sums file with (path, md5) for each installed file.
  • This listing of file paths help introspects an installed Debian-based OS/VM/Container to find which files were exactly installed from this package archive. Short of this you cannot guess this information.
  • When installed with apt/dpkg/aptitude these files as part of the regular Debian package install and copied to the actual installed rootfs. Tools that analyze Debian installations such as VMs or Docker images (like https://github.com/nexB/scancode.io/ that I co-maintain) rely on this to find which files come from which package.
    -The bazel docker rules ignore this md5sums file when building distroless containers that are built from Debian packages. It is never copied to the actual container (other metadata files are copied though)
    -Therefore it is impossible or hard to introspect and observer which files come from which original Debian package in a Distroless container built with these rules, because the md5sums list of these files is missing.

This introspection is important for security and license compliance.
Without this, Distroless containers make this process brittle and complicated to automate and creates extra busywork for no good reason.

@pombredanne pombredanne force-pushed the 1876-distroless-docker-md5sums branch from 89acc82 to 01c4b4d Compare May 25, 2022 09:51
@loosebazooka
Copy link
Contributor

Are scanners going to be aware of the distroless specific custom directory structure: /var/lib/dpkg/status.d? Why not just write this information into the dpkg standard .../info (and keep whatever is currently in ../status.d for continuity)?

@pombredanne
Copy link
Contributor Author

@loosebazooka re:

Are scanners going to be aware of the distroless specific custom directory structure: /var/lib/dpkg/status.d? Why not just write this information into the dpkg standard .../info (and keep whatever is currently in ../status.d for continuity)?

All the open source scanners that can handle distroless images have long been aware of the /var/lib/dpkg/status.d "oddity" and have long been able to handle this. 😇 😁

This is true for scancode.io and scancode-toolkit but also all the other open source tools capable of scanning Distroless images that I know of including tern and anchore.

In scancode-toolkit (and scancode.io), this is handled there https://github.com/nexB/scancode-toolkit/blob/aba31126dcb3ab57f2b885090f7145f69b67351a/src/packagedcode/debian.py#L345 (this has recently been deeply refactored but this was there long before).

All these tools open source (or not) will benefit directly from this small patch.

@pombredanne
Copy link
Contributor Author

pombredanne commented May 25, 2022

@loosebazooka

Why not just write this information into the dpkg standard .../info (and keep whatever is currently in ../status.d for continuity)?

I am all for it.... I guess I have no idea why a /status.d dir was used in the first place, except that it does allow independent installation without having to worry of merging control files in one?

Since this is already in odd territory that's no longer exactly Debian, I just kept on piling on this. Tell me if you want to use the "standard" dpkg /info directory instead. Frankly I grew to like the simple status.d/ dir.

@loosebazooka
Copy link
Contributor

Sure, I don't have a strong preference either way, I was just curious. I was hoping to reduce the burden on new scanners, but I barely know what that would even mean.

LGTM from the distroless side.

@gravypod
Copy link
Collaborator

This was reviewed by the SWEs on Distroless and it looks good. I'm going to approve this PR but we can't merge until the tests go green. Some of these tests are flakey so they might just need to be rerun.

Copy link
Collaborator

@gravypod gravypod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets get the tests pass then we can submit this. Thanks for putting this together!

@pombredanne
Copy link
Contributor Author

I triggered a build by removing odd trailing whitespaces ;)

@pombredanne
Copy link
Contributor Author

@gravypod the CI failure is about a failure to fetch https://bazel.build/bazel-release.pub.gpg
The checksum of this key is not the expected one.
I have no control over this.

(14:36:32) WARNING: Download from https://bazel.build/bazel-release.pub.gpg failed: class com.google.devtools.build.lib.bazel.repository.downloader.UnrecoverableHttpException Checksum was 8375bd5de1778a9fbb58a482a7ce9444ab9b1f6bb5fddd3700ae86b3fe0e4d3a but wanted 547ec71b61f94b07909969649d52ee069db9b0c55763d3add366ca7a30fb3f6d
(14:36:32) ERROR: An error occurred during the fetch of repository 'bazel_gpg':
   Traceback (most recent call last):
	File "/var/lib/buildkite-agent/.cache/bazel/_bazel_buildkite-agent/ec321eb2cc2d0f8f91b676b6d4c66c29/external/bazel_tools/tools/build_defs/repo/http.bzl", line 150, column 33, in _http_file_impl
		download_info = ctx.download(
Error in download: java.io.IOException: Error downloading [https://bazel.build/bazel-release.pub.gpg] to /var/lib/buildkite-agent/.cache/bazel/_bazel_buildkite-agent/ec321eb2cc2d0f8f91b676b6d4c66c29/external/bazel_gpg/file/downloaded: Checksum was 8375bd5de1778a9fbb58a482a7ce9444ab9b1f6bb5fddd3700ae86b3fe0e4d3a but wanted 547ec71b61f94b07909969649d52ee069db9b0c55763d3add366ca7a30fb3f6d

This ensures that a "distroless" container layer tarball built from
Debian packages contains not only the control file of each package, but
also the md5sums file that lists original files included in a package.
If present, we extract the md5sums file and save is side-by-side with
the package control file under this path:
  var/lib/dpkg/status.d/<package-name>.md5sums

Reference: bazelbuild#1876
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
@pombredanne pombredanne force-pushed the 1876-distroless-docker-md5sums branch from e25680e to c63f166 Compare May 26, 2022 14:49
@pombredanne
Copy link
Contributor Author

pombredanne commented May 26, 2022

@gravypod the issue (or the fix) is that the published gpg checksum has just been updated

sha256 = "8375bd5de1778a9fbb58a482a7ce9444ab9b1f6bb5fddd3700ae86b3fe0e4d3a",
by @sluongng and seems to be because of a deeper global issue of key expiration bazelbuild/bazel#15558

I rebased and forced pushed and this should clean this up hopefully.

@pombredanne
Copy link
Contributor Author

@gravypod we are all green now 🍏 🎉
@loosebazooka would you have an idea of how long this will take to propagate to actual public distroless images (as used in k8s) once merged?

@gravypod gravypod merged commit 558ae6f into bazelbuild:master May 30, 2022
@pombredanne pombredanne deleted the 1876-distroless-docker-md5sums branch May 30, 2022 15:47
St0rmingBr4in pushed a commit to St0rmingBr4in/rules_docker that referenced this pull request Oct 17, 2022
* Add md5sums file list to distroless container

This ensures that a "distroless" container layer tarball built from
Debian packages contains not only the control file of each package, but
also the md5sums file that lists original files included in a package.
If present, we extract the md5sums file and save is side-by-side with
the package control file under this path:
  var/lib/dpkg/status.d/<package-name>.md5sums

Reference: bazelbuild#1876
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

* Remove trailing whitespaces

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants