-
Notifications
You must be signed in to change notification settings - Fork 506
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve product vendor matching for component list scanning #1504
Comments
I like the heuristic approach. I'd like to suggest that we make sure to log the mappings in a way that users can tell that we "guessed" and give them some way to fine-tune the guesses and save the data so they don't have to re-run the heuristic on subsequent runs. I think we talked about this with the known package lists, but it turned out the mappings were obvious often enough that it wasn't absolutely needed to get the scripts working. That said, I still think it's quite reasonable for cve-bin-tool to maintain some lookup tables to improve mappings as we continue to improve our detection capabilities. I think a lookup table to supplement improved heuristics has a few advantages:
We'd want to make it easy for users to contribute knowledge back to us -- maybe prompt them to open a github issue with the data, with the carrot that updating the mappings would make the warnings go away in future versions? (Or even a direct pull request, but I think issues are probably easier.) I'm not sure about the best format here. For the running of the tool, we'd probably want to use a sqlite db the way we do with nvd, but for pull requests and analysis I think we might want something more text-based and diff-able to improve pull requests and make it easier for people to view the data directly. JSON maybe? And then have the tool consume it into sqlite and update if a new json is provided? Something else? We should probably spend a bit of time figuring out how the mappings are likely to work and come up with a reasonable data structure. At a guess, we'll have at least two types of mapping:
|
Terri
Looks like I have started something that could be a step change in
improving the detection capability of the tool.
I agree it needs some more thinking and some design/architecture work would
probably be worthwhile before we launch into implementation. No idea if
this would be a suitable GSOC project or not because I can't work out how
hard it will be become at this stage.
I was also thinking of identifying when we find multiple vendors for a
product that maybe we should flag this differently; we currently just put
a * to say we guessed the vendor - maybe if there are multiple vendors
available we should flag this differently (this should be relatively easy
to do). Alternatively we could just provide all product/vendor mappings
and find all of the potential vulnerabilities.
I think we can get some automation from the checkers to pre-populate a look
up table although it might also be good to allow a user to add new mappings
(essentially a form of crowd-sourcing :-)).
Regards
Anthony
…On Thu, 6 Jan 2022, 20:43 Terri Oda, ***@***.***> wrote:
I like the heuristic approach. I'd like to suggest that we make sure to
log the mappings in a way that users can tell that we "guessed" and give
them some way to fine-tune the guesses and save the data so they don't have
to re-run the heuristic on subsequent runs.
I think we talked about this with the known package lists, but it turned
out the mappings were obvious often enough that it wasn't absolutely needed
to get the scripts working. That said, I still think it's quite reasonable
for cve-bin-tool to maintain some lookup tables to improve mappings as we
continue to improve our detection capabilities. I think a lookup table to
supplement improved heuristics has a few advantages:
1. Make it possible possible to map multiple {vendor, product} pairs
to a known product string (e.g. I think kerberos has multiple match
options).
2. Possibly speed up runs where the heuristic would be running
frequently against similar components (e.g. libc6 on full container scans)
3. Help us build data about what mappings look like so we can fine
tune the heuristic.
4. Help us build data about mappings as a community service.
5. Allow us to do fancy "if you got this string from pypi it means one
thing but in a .jar file it means something else" matching if we wanted.
We'd want to make it easy for users to contribute knowledge back to us --
maybe prompt them to open a github issue with the data, with the carrot
that updating the mappings would make the warnings go away in future
versions? (Or even a direct pull request, but I think issues are probably
easier.)
I'm not sure about the best format here. For the running of the tool, we'd
probably want to use a sqlite db the way we do with nvd, but for pull
requests and analysis I think we might want something more text-based and
diff-able to improve pull requests and make it easier for people to view
the data directly. JSON maybe? And then have the tool consume it into
sqlite and update if a new json is provided? Something else?
We should probably spend a bit of time figuring out how the mappings are
likely to work and come up with a reasonable data structure.
At a guess, we'll have at least two types of mapping:
1. [common guess string] -> [NVD {vendor, product} pair] -- These
would be mostly 1:1 but potentially also n:n where you could have multiple
strings map to multiple NVD pairs. (e.g. kerberos and krb5 -> {mit,
kerberos} and {mit, kerberos_5} as we currently see in the checker.)
2. [common guess string] + [metadata] -> [NVD {vendor, product} pair]
-- where the metadata would probably be something about where we found the
string, so 'cryptography' in a python requirements.txt wouldn't have
to map to the same thing as crypgraphy when found in a java .jar file.
—
Reply to this email directly, view it on GitHub
<#1504 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACAID24IP3POKYLFHAGD5M3UUX5IVANCNFSM5LLAIIEA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Some progress on this to result in some improved product/vendor matching (I
have just tried this with SBOMs for the time being to try out some ideas;
there needs to be more thought as regards to how it gets incorporated so
that the whole tool benefits)
1. If multiple vendors are matched, I have modified the code so that all
product/vendor mappings are added to the list of parsed products. I found
out that the simple test of choosing the first one in the list was missing
a valid product/vendor mapping. The downside of this is that there is
potentially some increased false or duplicated reporting but I think that
is manageable.
2. I have used the filename patterns used by the checkers as a way of
mapping to correct product names (e.g. search for libc6 which is not in the
NVD maps to glibc which is in the NVD). This improves the hit rate of
products/vendor mappings; the only downside is that the original product
name is not reported.
3. I have noticed that a number of product names are typically reported as
A-B (e.g. commons-io) when the product name in the NVD is A_B (i.e.
commons_io). I have changed the search to look for both names - this has an
improved hit rate particularly for java based products
4. This seems to primarily apply for Java products, but product names are
often reported with a parent package followed by a component name e.g.
jetty-<component>. Modifying the search to remove the component name and
just search for the parent package increases the hit rate.
These changes result in many more candidate products to have an associated
vendor with potential vulnerabilities to be reported.
None of these changes involve any changes to the database structure.
However the mapping performed by the checkers of filenames to products is
limited to the availability of a checker. I think we may need to think of a
way of specifying additional filename to product mappings (as we discover
them), possibly by another configuration file to allow for user
enhancement/control and independence from the availability of a checker.
On Thu, 6 Jan 2022 at 21:18, Anthony Harrison ***@***.***>
wrote:
… Terri
Looks like I have started something that could be a step change in
improving the detection capability of the tool.
I agree it needs some more thinking and some design/architecture work
would probably be worthwhile before we launch into implementation. No idea
if this would be a suitable GSOC project or not because I can't work out
how hard it will be become at this stage.
I was also thinking of identifying when we find multiple vendors for a
product that maybe we should flag this differently; we currently just put
a * to say we guessed the vendor - maybe if there are multiple vendors
available we should flag this differently (this should be relatively easy
to do). Alternatively we could just provide all product/vendor mappings
and find all of the potential vulnerabilities.
I think we can get some automation from the checkers to pre-populate a
look up table although it might also be good to allow a user to add new
mappings (essentially a form of crowd-sourcing :-)).
Regards
Anthony
On Thu, 6 Jan 2022, 20:43 Terri Oda, ***@***.***> wrote:
> I like the heuristic approach. I'd like to suggest that we make sure to
> log the mappings in a way that users can tell that we "guessed" and give
> them some way to fine-tune the guesses and save the data so they don't have
> to re-run the heuristic on subsequent runs.
>
> I think we talked about this with the known package lists, but it turned
> out the mappings were obvious often enough that it wasn't absolutely needed
> to get the scripts working. That said, I still think it's quite reasonable
> for cve-bin-tool to maintain some lookup tables to improve mappings as we
> continue to improve our detection capabilities. I think a lookup table to
> supplement improved heuristics has a few advantages:
>
> 1. Make it possible possible to map multiple {vendor, product} pairs
> to a known product string (e.g. I think kerberos has multiple match
> options).
> 2. Possibly speed up runs where the heuristic would be running
> frequently against similar components (e.g. libc6 on full container scans)
> 3. Help us build data about what mappings look like so we can fine
> tune the heuristic.
> 4. Help us build data about mappings as a community service.
> 5. Allow us to do fancy "if you got this string from pypi it means
> one thing but in a .jar file it means something else" matching if we wanted.
>
> We'd want to make it easy for users to contribute knowledge back to us --
> maybe prompt them to open a github issue with the data, with the carrot
> that updating the mappings would make the warnings go away in future
> versions? (Or even a direct pull request, but I think issues are probably
> easier.)
>
> I'm not sure about the best format here. For the running of the tool,
> we'd probably want to use a sqlite db the way we do with nvd, but for pull
> requests and analysis I think we might want something more text-based and
> diff-able to improve pull requests and make it easier for people to view
> the data directly. JSON maybe? And then have the tool consume it into
> sqlite and update if a new json is provided? Something else?
>
> We should probably spend a bit of time figuring out how the mappings are
> likely to work and come up with a reasonable data structure.
>
> At a guess, we'll have at least two types of mapping:
>
> 1. [common guess string] -> [NVD {vendor, product} pair] -- These
> would be mostly 1:1 but potentially also n:n where you could have multiple
> strings map to multiple NVD pairs. (e.g. kerberos and krb5 -> {mit,
> kerberos} and {mit, kerberos_5} as we currently see in the checker.)
> 2. [common guess string] + [metadata] -> [NVD {vendor, product} pair]
> -- where the metadata would probably be something about where we found the
> string, so 'cryptography' in a python requirements.txt wouldn't have
> to map to the same thing as crypgraphy when found in a java .jar file.
>
> —
> Reply to this email directly, view it on GitHub
> <#1504 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/ACAID24IP3POKYLFHAGD5M3UUX5IVANCNFSM5LLAIIEA>
> .
> Triage notifications on the go with GitHub Mobile for iOS
> <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
> or Android
> <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
>
> You are receiving this because you authored the thread.Message ID:
> ***@***.***>
>
|
Mapping the equivalence of artifact identifiers across different naming schemes (cpe, swid, various purl namespaces, etc) will definitely be one of the challenges for reliably matching vulnerabilities to software - especially since the data quality at all sources will inevitably vary, so we'll also need to be able to compensate for badly-tagged data. I agree we need to crowd-source this data, and ideally share this effort among projects. Other sources like https://repology.org and https://www.aboutcode.org etc might also be valuable input for this. As a first step, though, I think 'normalizing' the product name by converting to lowercase and dropping any characters like
Which seems correct, but wasn't matched to https://nvd.nist.gov/vuln/detail/CVE-2022-42889 because the (for one thing) CPE, |
I think we're taking this in the direction of using PURL (i.e. what's planned in #3771 ) as our next phase of improving matching. So I'm going to close this issue, but we may want to revisit it later. |
Each of the checkers identifies a product/vendor pair to be used if a particular component is detected in a binary file. The allows for instance an item detected as libc or libc6 to be both mapped to the glibc product.
However if a component list is used (e.g. using SBOM or a linux distro ), the product name searched for will be libc or libc6 which as they are not found in the database. will not have any vulnerabilities reported.
One approach would be to have multiple approaches to determine if there is a potential match although there is a risk of an increase in false positives being detected. Some approaches to try could include the of a wildcard e.g. search for product like "%libc%" in a query, search for A-B and A_B, always search for lowercase names, etc
The text was updated successfully, but these errors were encountered: