Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add data-driven approach to v6 distro search #2265

Merged
merged 1 commit into from
Nov 21, 2024
Merged

Conversation

wagoodman
Copy link
Contributor

@wagoodman wagoodman commented Nov 18, 2024

Today we have a few hard coded elements that affect which namespace is selected for given distro information:

The upside of this logic is that it makes matching more accurate, the downside is that this kind of code requires a grype release to update (thus users need to update grype itself to get these updates). This is not ideal since a distro-related version change or vulnerability data change can happen at any time.

Ideally this should be fundamentally data driven and flow in from the grype database, which is updated nightly -- this PR nudges us heavily in that direction by migrating this kind of logic to a DB lookup table, which is populated with grype-db, and paired with more agnostic processing of distro data (for example, we still have version specificity searching, but it is unaware of specific ).

Open questions:

  • debian@sid: today we look for "sid" in the PRETTY_NAME for os-release, which is something that is fixed to the artifact, but as I understand it sid is a moving target... so wouldn't it be invalid to match non-trixie sid vuln data (foxy is the next codename) with trixie artifacts? Doesn't that mean we should exclusively look at the codename and ignore sid status altogether? We're going to need to appropriate pkg source lists into syft / grype / grype-db / vunnel. That is outside of the scope of this PR
  • alpine@edge: it looks like this would be a single version with multiple pre-releases, but the vunnel data is specifically hard coding this to "edge", is this fictitious? (should we just use versions as we find them in the vuln data) this is correct, we should be using the alpha versions to correlate with edge, no matter the major/minor version
  • should this logic be lifted from the DB into memory for faster continual access of OS specific questions? That is, much of this is querying the OS table for each distro search, but that table is relatively small and could be replaced with a single datastructure that is read into memory upon the first query. deferred; I don't want to optimize too much in advance though, so elected to not do this first since it could always be done at a later date since that could be implemented in a non-breaking way.

@wagoodman wagoodman added the changelog-ignore Don't include this issue in the release changelog label Nov 18, 2024
@wagoodman wagoodman added this to the DB v6 milestone Nov 18, 2024
@wagoodman wagoodman requested a review from a team November 18, 2024 22:24
@wagoodman wagoodman self-assigned this Nov 18, 2024
@kzantow
Copy link
Contributor

kzantow commented Nov 19, 2024

To offer opinions:

debian@sid: today we look for "sid" in the PRETTY_NAME for os-release, which is something that is fixed to the artifact, but as I understand it sid is a moving target... so wouldn't it be invalid to match non-trixie sid vuln data (foxy is the next codename) with trixie artifacts? Doesn't that mean we should exclusively look at the codename and ignore sid status altogether?

I was under the impression that there is no vuln data for old unstable versions -- that the unstable feed is only for the current unstable branch, is this correct? Do we know which codename (probably mixing up terms here) is currently in unstable at any given point? If so, I think considering something with an old codename + sid status as a prerelease for that codename would be the right thing to do, if we have enough info to do so.

alpine@edge: it looks like this would be a single version with multiple pre-releases, but the vunnel data is specifically hard coding this to "edge", is this fictitious? (should we just use versions as we find them in the vuln data)

IIRC there was a difficulty mapping the unstable alpine feed without this behavior. Maybe just that the data for the distro isn't published in the normal feed until the release becomes stable? But if the versions align without doing some mapping to "edge", I would agree it sounds better to just use the version name directly.

should this logic be lifted from the DB into memory for faster continual access of OS specific questions? That is, much of this is querying the OS table for each distro search, but that table is relatively small and could be replaced with a single datastructure that is read into memory upon the first query. I don't want to optimize too much in advance though, so elected to not do this first since it could always be done at a later date since that could be implemented in a non-breaking way.

Probably, yes. Agree it doesn't have to be done here -- only optimize things that need it. I worked on a configuration-heavy app that did something like this: always keep configuration loaded in memory and it worked great. If nothing else, it kept a lot of the programming model easier.

@wagoodman
Copy link
Contributor Author

It looks like there are a couple of answers here:

  • debian sid will be dealt with later as we'll need to add another enhancement: searching by source list. This is out of scope for this PR and will be added at a later time
  • I've confirmed that alpine edge processing is simply looking for alpha in the version

@wagoodman wagoodman marked this pull request as ready for review November 19, 2024 20:52
Signed-off-by: Alex Goodman <wagoodman@users.noreply.github.com>
return query
// search for aliases for the given distro; we intentionally map some OSs to other OSs in terms of
// vulnerability (e.g. `centos` is an alias for `rhel`). If an alias is found always use that alias in
// searches (there will never be anything in the DB for aliased distros).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what happens here when we do add affected entries for a distro that was previous an alias (like when we bring in data for almalinux)? I think what would happen is that the alias would no longer exist in the grype db that is published that now contains the almalinux data, but just want to validate that assumption.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And all prior clients would adjust correctly to it becoming a comprehensive vulnerability dataset rather than an alias to rhel data

@wagoodman wagoodman merged commit df54421 into main Nov 21, 2024
10 checks passed
@wagoodman wagoodman deleted the v6-data-driven-distro branch November 21, 2024 15:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
changelog-ignore Don't include this issue in the release changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants