Skip to content

Latest commit

 

History

History
265 lines (163 loc) · 16.9 KB

scanning_implementation.md

File metadata and controls

265 lines (163 loc) · 16.9 KB

Scanning Implementation

This article explains how to implement vulnerabilty scanning for Chainguard Images and Wolfi.

At a high level, this happens in three steps:

  1. Detect that the container image declares its distro as either "Wolfi" or "Chainguard".
  2. Build a list of distro packages that have been installed.
  3. For each package on that list, determine the known vulnerabilities that affect the package.

Note: This article assumes you're already familiar with the material covered in Foundational Concepts.

Step 1: Detecting the distro

🎯 Goal of this step: Determine if the distro ID is either wolfi or chainguard.

Similar to many Linux distributions, distro-identifying information is found on the filesystem at /etc/os-release.

Within the Chainguard ecosystem, this file will indicate that the distro is either "Wolfi" or "Chainguard".

Here's an example /etc/os-release for Wolfi:

ID=wolfi
NAME="Wolfi"
PRETTY_NAME="Wolfi"
VERSION_ID="20230201"
HOME_URL="https://wolfi.dev"

And here's an example /etc/os-release for Chainguard:

ID=chainguard
NAME="Chainguard"
PRETTY_NAME="Chainguard"
VERSION_ID="20230214"
HOME_URL="https://chainguard.dev/"

Scanners should parse this file to identify the value in the ID field. This value will always either be wolfi or chainguard. No other fields in this file are relevant to scanners.

If the ID field is set to any value other than wolfi or chainguard, the distro is not part of Chainguard's ecosystem, and this documentation does not apply to it.

Chainguard's distros are not versioned

The Wolfi and Chainguard distros themselves do not have versions. This is unlike some other Linux distros (like Debian, Ubuntu, RHEL, etc.) that do have versions. Chainguard's distros each can be thought of as "rolling", everpresent collections of packages.

"Well then, what's that VERSION_ID field doing there??"

Good question!

The important takeaways here are:

  1. This value has no bearing on vulnerability scanning, and
  2. This value should not be represented to your scanner's users as the "distro version".

For the curious: The longer answer is that in the Chainguard ecosystem, virtually every file in a container image belongs to exactly one distro package, including the /etc/os-release file!

So, for example, Wolfi has a package called wolfi-baselayout. You can find its build definition here. In that build definition, you might notice a line that has VERSION_ID="${{package.version}}". From that line, we can deduce that in the above Wolfi example, the line VERSION_ID="20230214" is telling us that the value 20230214 is actually the version of the wolfi-baselayout package that happened to be included in the container image. And consequently, this version has no bearing on the installed version of any other package in the image.

Using SBOMs to detect the distro

Some vulnerability scanners have support for scanning an SBOM (software bill of materials) of a software artifact, in addition to (or instead of) scanning the raw software artifact itself (e.g. a container image).

If your scanner has this capability, it's acceptable to rely on the SBOM's description of the distro, as long as you can determine the identity of the underlying container image's distro as being either wolfi or chainguard.

Step 2: Detecting installed packages

🎯 Goal of this step: Build a list of all installed distro packages in the container image.

For each package, we need to capture the following bits of information, at a minimum:

  1. The package's name
  2. The package's version
  3. The package's "origin package"

Given that you've determined that the contianer image's distro is either wolfi or chainguard, you can find a database of all distro packages that have been installed at /lib/apk/db/installed. The specification for this "installed database" is documented here. The fields we're interested in are documented here.

You'll need to parse the installed database, capturing the data we need for every entry in the database.

Here's an excerpt of an installed database that describes an installed package called "libcrypto3":

P:libcrypto3
V:3.1.2-r0
A:x86_64
L:Apache-2.0
T:OpenSSL libcrypto library
o:openssl

The P: field tells us the package's name.

The V: field tells us the package's version.

The o: field tells us the name of the package's origin package.

At the end of parsing, you should have a list of all the packages installed in the image, along with the data that we'll use in Step 3: Matching installed packages to vulnerabilities.

What's an "origin package"?

In the APK ecosystem, package metadata can express not only a package's own name, but the name of an "origin" package. The APK specification doesn't specify what exactly this means. But for Wolfi packages and Chainguard's private packages in particular, "origin" packages correlate to our build definitions (such as this one for the Wolfi openssl package.)

Each build definition is used to build a package, but it can also be used to create "subpackages" associated with that main package.

These "main" packages are the "origin" packages, and their metadata will have an "origin" value equal to their "name" value. Subpackages, on the other hand, will have their own distinct "name" value, and their "origin" value will reference the "main" package for the build definition that was used to create the subpackage.

Subpackages share the same version numbers as their origin package, so package foo version 1.2.3-r4 was built and released at the same time as package foo-subpackage version 1.2.3-r4.

Using SBOMs to detect installed packages

If the SBOM already lists the installed distro packages for you, and each package item includes the data fields we need, you can rely on the SBOM and avoid parsing the image's installed database.

Step 3: Matching installed packages to vulnerabilities

🎯 Goal of this step: Produce a list of vulnerability matches, where each match links an installed distro package to a known vulnerability, using either the secdb associated with the identified distro or the unified OSV feed.

(If using the secdb) Selecting the correct vulnerability data set

As mentioned in "Foundational Concepts", Chainguard provides two different secdbs. We need to be sure we're using data from the correct secdb.

We make this choice using the distro ID we got from Step 1: Detecting the distro:

  • ID=wolfi: Use the Wolfi secdb, found at https://packages.wolfi.dev/os/security.json.

  • ID=chainguard: Use the Chainguard secdb, found at https://packages.cgr.dev/chainguard/security.json.

(Optional) Incorporating secdb data into your own database

Many vulnerability scanners choose to create their own, query-optimized database of vulnerability information, rather than to have their scanners rely directly on upstream data sources like Chainguard's secdbs.

If this is the case for your scanner, just make sure to keep track of which records in your database correspond to which of Chainguard's secdbs. For example, when we say "use the Wolfi secdb", that would mean "SELECT ... WHERE source = wolfi-secdb", translated to however you query your data.

Also remember that Chainguard's data feeds are updated frequently. You'll want to ensure your database is incorporating the latest data from Chainguard's data feeds at least daily, if not more frequently, so that your users see correct vulnerability scan results as soon as possible.

Finding vulnerabilities for each package

For each package on your list from Step 2: Detecting installed packages, you should know the package's name, version, and origin package.

For the sake of an example, let's say we have this package on our list:

{
  "name": "libcrypto3",
  "version": "3.1.1-r2",
  "origin": "openssl"
},

Using the secdb to find vulnerability matches

(If you're using the OSV feed instead, jump down to here.)

The next step is to identify vulnerabilities for this package, using data from the selected secdb.

📣 Package vulnerabilities are filed in the secdb by origin package name.

So for our example, we'll look for secdb records filed under the package "openssl" in order to find potential vulnerabilities in our "libcrypto3" package.

Let's say this is the data from the selected secdb for the origin package "openssl":

{
  "pkg": {
    "name": "openssl",
    "secfixes": {
      "0": [
        "CVE-2023-0466",
        "CVE-2023-4807"
      ],
      "3.0.8-r0": [
        "CVE-2022-4203",
        "CVE-2022-4304"
      ],
      "3.1.1-r2": [
        "CVE-2023-2975"
      ],
      "3.1.1-r3": [
        "CVE-2023-3446"
      ],
      "3.1.1-r4": [
        "CVE-2023-3817"
      ]
    }
  }
},

Determining which of these vulnerabilities should be matched to our installed "libcrypto3" package is a matter of comparing the installed version of "libcrypto3" (discovered during Step 2) to each of the versions listed here under secfixes.

For any secfixes entry where our installed package version is LESS THAN the listed secfixes entry version, that secfixes entry's vulnerabilities affect the installed package!

So following our example, where our container image has libcrypto3 version 3.1.1-r2 installed, we'd know that our package is affected by CVE-2023-3446 and CVE-2023-3817 — because those CVEs are addressed in later versions of our package than what we have installed in the image. And, we know that our package is not affected by CVE-2023-0466, CVE-2023-4807, CVE-2022-4203, CVE-2022-4204, or CVE-2023-2975.

Using the OSV feed to find vulnerability matches

To find vulnerabilities in the OSV feed that match a given package, you only need to know the distro ID, the package's name, and its version. (You don't need to find the package's origin package name.)

Packages in the OSV feed are identified by purl (Package URL — see the specification for more information).

OSV JSON documents have an .affected property, which is a list of objects. Each object has a .package property, which itself has a .purl property. You can index this value in order to quickly find vulnerability data given the above information.

For example, if you find that an image's distro ID is wolfi, and there's an APK package named ko installed, you can look for affected ranges in the OSV documents where the package purl is pkg:apk/wolfi/ko.

From there, you can find that affected object's .ranges property in order to compare the APK package's installed version to the versions affected by the given vulnerability.

The meaning of version "0"

You probably noticed above that one of the "fixed" versions in examples above was "0". There was never a version "0" of the libcrypto3 or openssl packages. So what does this mean?

As a standard part of Chainguard's analysis, we identify vulnerabilities that are false positives and have been incorrectly associated with our packages, due to NVD data being incorrect or incomplete, a bug in an existing scanner implementation, etc. When this happens, we record that the vulnerability does not apply to our package in Chainguard's advisory data.

These false positive conclusions get translated to secfixes entries (and OSV data) with version "0", because 0 sorts as less than any possible installed version of an APK package.

(Optional) Supplementing Chainguard's data with additional vulnerability information

It's perfectly acceptable to use only Chainguard's data when performing a scan of a Wolfi or Chainguard distro container image. Chainguard is continuously monitoring numerous vulnerability data sources to identify vulnerabilities that could affect any of its packages, and address any potential vulnerabilities with fixes or other advisory updates (such as identifying false positives).

However, some scanner teams may prefer to add more vulnerability assessment to a container image scan, in one or both of the following ways:

  1. Matching installed packages to NVD data (or an internal dataset derived from NVD data), in addition to Chainguard's data.
  2. Discovering additional "packages" than what's listed in the installed database (e.g. via SCA), and performing additional vulnerability matching against those packages.

Let's look at the correct way to implement these extended functionalities, in case they're relevant to your scanner. Again, this section is not part of the core matching implementation required for Chainguard Images and Wolfi.

Matching installed distro packages to NVD data

Finding accurate vulnerability matches using NVD data is notoriously difficult. NVD itself provides very little in the way of machine-readable data that's useful for querying and matching; and what's more, CVE records are not consistently assessed for quality before admission into the NVD.

If you do choose to use NVD data in matching, the implementation should be as follows:

For each package,

  1. Use your NVD dataset and matching implementation to determine the set of potential vulnerability matches to the package.
  2. Use the Chainguard dataset to determine a set of vulnerabilities known not to affect the installed package, and use this dataset to "subtract" vulnerabilities from the set of NVD-sourced vulnerabilities. (This, for example, is where the version "0" data can be useful.)
  3. Vulnerabilities matched to the installed package using solely Chainguard's data should still be included in the final result set, per our standard implementation guidance.

Discovering non-distro packages for vulnerability matching

Your scanner may include additional SCA capabilities, such as to find language ecosystem ("non-distro") packages like Ruby Gems, Go modules, NPM packages, etc. In addition, your scanner may support scanning of SBOMs, which may include non-distro packages in addition to distro packages.

If you choose to attempt vulnerability matching for non-distro packages, the first thing to be aware of is that often, non-distro packages are "owned" by distro packages.

Referring back to the APK spec for the installed database, you'll see that for each package, you can additionally parse which filesystem paths are managed by that distro package.

If you're detecting non-distro packages, it's crucial that you keep track of which filesystem paths were used as evidence for the non-distro package. That's because you need to determine if the non-distro package evidence paths are wholly contained in the declared filesystem paths for the package in the installed database. If that is the case, then the distro package "owns" the non-distro package.

In the wild, it's possible to find non-distro packages that are not owned by distro packages. For example, a user may retrieve a container image from Chainguard to use as their base image, and then continue to install additional software that didn't come from Chainguard.

If your scanner is including non-distro packages in the scope of its vulnerability matching, the implementation should be as follows:

  1. During distro package detection, be sure to additionally parse the list of filesystem paths associated with each package.

  2. Perform distro package vulnerability matching per our standard implementation guidance.

  3. Perform non-distro vulnerability matching as you do today, making sure to record the filesystem paths used as evidence for each package.

  4. For each non-distro package vulnerability match, determine if the package is owned by a distro package.

    a. If so, determine if the vulnerability is listed in Chainguard's data for the owning package's origin package, for a fixed version LESS THAN OR EQUAL TO the installed version of the owning distro package. If so, this vulnerability match must be subtracted out of the final scan result; otherwise, the vulnerability should be included in the final scan result.

    b. If not, Chainguard's data does not account for this vulnerability, and it should be included in the final scan result.


Next up: Verifying Scan Results