Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve product vendor matching for component list scanning #1504

Closed
anthonyharrison opened this issue Jan 5, 2022 · 5 comments
Closed

Improve product vendor matching for component list scanning #1504

anthonyharrison opened this issue Jan 5, 2022 · 5 comments

Comments

@anthonyharrison
Copy link
Contributor

Each of the checkers identifies a product/vendor pair to be used if a particular component is detected in a binary file. The allows for instance an item detected as libc or libc6 to be both mapped to the glibc product.

However if a component list is used (e.g. using SBOM or a linux distro ), the product name searched for will be libc or libc6 which as they are not found in the database. will not have any vulnerabilities reported.

One approach would be to have multiple approaches to determine if there is a potential match although there is a risk of an increase in false positives being detected. Some approaches to try could include the of a wildcard e.g. search for product like "%libc%" in a query, search for A-B and A_B, always search for lowercase names, etc

@terriko
Copy link
Contributor

terriko commented Jan 6, 2022

I like the heuristic approach. I'd like to suggest that we make sure to log the mappings in a way that users can tell that we "guessed" and give them some way to fine-tune the guesses and save the data so they don't have to re-run the heuristic on subsequent runs.

I think we talked about this with the known package lists, but it turned out the mappings were obvious often enough that it wasn't absolutely needed to get the scripts working. That said, I still think it's quite reasonable for cve-bin-tool to maintain some lookup tables to improve mappings as we continue to improve our detection capabilities. I think a lookup table to supplement improved heuristics has a few advantages:

  1. Make it possible possible to map multiple {vendor, product} pairs to a known product string (e.g. I think kerberos has multiple match options).
  2. Possibly speed up runs where the heuristic would be running frequently against similar components (e.g. libc6 on full container scans)
  3. Help us build data about what mappings look like so we can fine tune the heuristic.
  4. Help us build data about mappings as a community service.
  5. Allow us to do fancy "if you got this string from pypi it means one thing but in a .jar file it means something else" matching if we wanted.

We'd want to make it easy for users to contribute knowledge back to us -- maybe prompt them to open a github issue with the data, with the carrot that updating the mappings would make the warnings go away in future versions? (Or even a direct pull request, but I think issues are probably easier.)

I'm not sure about the best format here. For the running of the tool, we'd probably want to use a sqlite db the way we do with nvd, but for pull requests and analysis I think we might want something more text-based and diff-able to improve pull requests and make it easier for people to view the data directly. JSON maybe? And then have the tool consume it into sqlite and update if a new json is provided? Something else?

We should probably spend a bit of time figuring out how the mappings are likely to work and come up with a reasonable data structure.

At a guess, we'll have at least two types of mapping:

  1. [common guess string] -> [NVD {vendor, product} pair] -- These would be mostly 1:1 but potentially also n:n where you could have multiple strings map to multiple NVD pairs. (e.g. kerberos and krb5 -> {mit, kerberos} and {mit, kerberos_5} as we currently see in the checker.)
  2. [common guess string] + [metadata] -> [NVD {vendor, product} pair] -- where the metadata would probably be something about where we found the string, so 'cryptography' in a python requirements.txt wouldn't have to map to the same thing as crypgraphy when found in a java .jar file.

@anthonyharrison
Copy link
Contributor Author

anthonyharrison commented Jan 6, 2022 via email

@anthonyharrison
Copy link
Contributor Author

anthonyharrison commented Jan 12, 2022 via email

@raboof
Copy link
Contributor

raboof commented Oct 27, 2022

Mapping the equivalence of artifact identifiers across different naming schemes (cpe, swid, various purl namespaces, etc) will definitely be one of the challenges for reliably matching vulnerabilities to software - especially since the data quality at all sources will inevitably vary, so we'll also need to be able to compensate for badly-tagged data.

I agree we need to crowd-source this data, and ideally share this effort among projects. Other sources like https://repology.org and https://www.aboutcode.org etc might also be valuable input for this.

As a first step, though, I think 'normalizing' the product name by converting to lowercase and dropping any characters like _ and - would already be a nice improvement for cve-bin-tool. My use case here was that I noticed commons-text was detected by syft (in cyclonedx-json mode) as:

    {
      "bom-ref": "pkg:maven/org.apache.commons/commons-text@1.8?package-id=50aab321a9f4b2fa",
      "type": "library",
      "group": "org.apache.commons",
      "name": "commons-text",
      "version": "1.8",
      "cpe": "cpe:2.3:a:apache-software-foundation:commons-text:1.8:*:*:*:*:*:*:*",
      "purl": "pkg:maven/org.apache.commons/commons-text@1.8",
      ...
      "properties": [
         ...
      ]
   }

Which seems correct, but wasn't matched to https://nvd.nist.gov/vuln/detail/CVE-2022-42889 because the (for one thing) CPE, cpe:2.3:a:apache:commons_text:*:*:*:*:*:*:*:*, uses an underscore.

@terriko
Copy link
Contributor

terriko commented Apr 17, 2024

I think we're taking this in the direction of using PURL (i.e. what's planned in #3771 ) as our next phase of improving matching. So I'm going to close this issue, but we may want to revisit it later.

@terriko terriko closed this as completed Apr 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants