Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add product synonyms #2819

Open
anthonyharrison opened this issue Mar 14, 2023 · 11 comments
Open

Add product synonyms #2819

anthonyharrison opened this issue Mar 14, 2023 · 11 comments

Comments

@anthonyharrison
Copy link
Contributor

#2685 is related

Some products have multiple names and it would be good if we could handle this in an elegant way particulalry for the language and SBOM parsers although there may be some benefits for the checkers as well. My idea would be to have a list of synonyms which can be checked against.

I am thinking particulalry of Java packages which can sometimes be included as org.xxxx in a POM.XML file but it can also be referred to as xxxx.jar when scanning a directory or archive.

Thoughts?

@terriko
Copy link
Contributor

terriko commented Mar 14, 2023

The checkers already do have synonyms in that they support multiple {vendor, product} pairs. I'm wondering if maybe we should have a similar format for information for basically everything we detect.

I know @anthonyharrison knows what this looks like but for the benefit of writing it out here's what this could look like with my favourite multi-name example, beautifulsoup (it's my favourite because all the names are on the website for me to cut and paste easily):

  • CPE / {vendor, product} pairs for nvd
  • packaging names / { package_source, name }
    • {pypi, beautifulsoup4}
    • {pypi, bs4} (yes, it's a real package to limit typosquatting: https://pypi.org/project/bs4/)
    • {debian, python3-bs4}
    • {ubuntu, python3-bs4}
    • {fedora, python-beautifulsoup4}

That would let us potentially map all the package pairs to nvd lookup pairs as an N:N set, which I think is something we need.

The next question would be... how do we store and use this? We could potentially extend the existing checker format:

class BeautifulSoupChecker(Checker):
    CONTAINS_PATTERNS: list[str] = []
    FILENAME_PATTERNS: list[str] = []
    VERSION_PATTERNS = []
    VENDOR_PRODUCT = [("crummy_not_in_db", "beautifulsoup4")]
    PYPI_PACKAGE = ["beautifulsoup4", "bs4"]
    DEBIAN_PACKAGE = ["python-bs4", "python3-bs4"]
    FEDORA_PACKAGE = ["python-beautifulsoup4"]

Some notes here:

  • since bs4 doesn't actually have any CVEs I'm using the same _not_in_db indicator we use for our own requirements scan; we could maybe do something fancier or codify that better.
  • there's no VERSION_PATTERN which would mean this couldn't be used in the binary scanner. That might be a thing we want, or maybe we'd want to add a pattern -- I'm not sure what the right way to go is or if it's maybe have both be an option? But we'd definitely need to consider how to handle non-binary checkers/checkers without binary search patterns and do it consistently.
  • I opted for a variable for each packaging type -- we might prefer to group the _PACKAGE ones into a single data structure that could be iterated through more easily. Not sure.

Open questions:

  • Should this actually be in python code, or are we at the point where this should be json input or something else? (We went with python for the checkers to facilitate the regexes, but that might not apply to non-binary checkers.)
  • Should the binary checker data and the packaging data be in separate file formats?
  • Do we want to make it possible to have explicit non-matches? e.g. be able to say that {java_jar, json-parser} is not the same as {fedora, json-parser} and doesn't use the same nvd lookup pairs?

@terriko
Copy link
Contributor

terriko commented Mar 14, 2023

Design goals:

  • We want to make it easy for people to do pull requests and add/fix data. (so, probably use a text-based format that can be edited by hand and handled similar to code pull requests)
  • We may want to be able to list binary checkers and each package types' "checkers" (metadata) separately
  • We will want to integrate this data into the other metadata discussed in GSoC 2023 Project idea: Improved product representation & meta-info about products. #2633 (assuming we get someone to work on that project via summer of code)

@ffontaine
Copy link
Contributor

release-monitoring.org could be helpful to retrieve the different package names or "synonyms" used by distributions.
Here is the web page for beautifouls-soup4: https://release-monitoring.org/project/3779
release-monitoring.org is actively used by Fedora, alpine, Arch-Linux, buildroot: https://release-monitoring.org/distros
release-monitoring could also be used to retrieve and display the latest version for a given package.

@anthonyharrison
Copy link
Contributor Author

@ffontaine Thanks for the reference to release-monitoring.org. I note it has an API which would make integration with cve-bin-tool relatively easy although it wouldn't work in offline mode unless we could mirror a local copy of the database.

@ffontaine
Copy link
Contributor

Indeed, we're already using the API in buildroot to generate this web page: http://autobuild.buildroot.org/stats/master.html (the source code is here: https://git.buildroot.net/buildroot/tree/support/scripts/pkg-stats). I don't know if we can retrieve a local copy.

@metabiswadeep
Copy link
Contributor

metabiswadeep commented Mar 18, 2023

@terriko So in that project the metadata that needs to be added can be added in the checker files of their respective products using extra parameters like LICENSE_INFO=[""] defined in it?

@terriko
Copy link
Contributor

terriko commented Mar 20, 2023

@terriko So in that project the metadata that needs to be added can be added in the checker files of their respective products using extra parameters like LICENSE_INFO=[""] defined in it?

Maybe. There's an open question of whether this should actually go in the checker file itself or whether it should be a separate thing, and a proposal could go either way. I'll put some more thoughts directly in the gsoc issue.

@terriko
Copy link
Contributor

terriko commented Mar 27, 2023

slightly pedantic note: it appears that there is a CPE for beautifulsoup:

  "cpe": "cpe:2.3:a:leonard_richardson:beautifulsoup4:4.12.0:*:*:*:*:*:*:*",

Although my assertion about it not having one above may have been incorrect, the fact that we'll likely recognize some number of products that don't have them stands.

@ffontaine
Copy link
Contributor

Where did you get this CPE? I didn't found it on cvedetails.com or nvd.nist.gov

@terriko
Copy link
Contributor

terriko commented Mar 28, 2023

Hm, maybe ti's just what's auto-generated by the sbom tool and not a real CPE id? It's still likely more correct than my previous entry but maybe we do need to annotate these better.

@anthonyharrison
Copy link
Contributor Author

@terriko @ffontaine sbom4python autogenerates the CPE and PURL references based on the project metadata. This may not be correct but I do state this in the documentation, 'Whilst PURL and CPE references are automatically generated for each Python module, the accuracy of such references cannot be guaranteed as they are dependent on the validity of the data associated with the Python module.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants