-
-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(environment): support collecting license evidence from installed distributions #674
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -16,8 +16,10 @@ | |
# Copyright (c) OWASP Foundation. All Rights Reserved. | ||
|
||
|
||
import re | ||
from argparse import OPTIONAL, ArgumentParser | ||
from importlib.metadata import distributions | ||
from base64 import b64encode | ||
from importlib.metadata import Distribution, distributions | ||
from json import loads | ||
from os import getcwd, name as os_name | ||
from os.path import exists, isdir, join | ||
|
@@ -26,8 +28,9 @@ | |
from textwrap import dedent | ||
from typing import TYPE_CHECKING, Any, Dict, Iterable, List, Optional, Tuple | ||
|
||
from cyclonedx.model import Property | ||
from cyclonedx.model.component import Component, ComponentType | ||
from cyclonedx.model import AttachedText, Encoding, Property | ||
from cyclonedx.model.component import Component, ComponentEvidence, ComponentType | ||
from cyclonedx.model.license import DisjunctiveLicense | ||
from packageurl import PackageURL | ||
from packaging.requirements import Requirement | ||
|
||
|
@@ -47,6 +50,8 @@ | |
|
||
T_AllComponents = Dict[str, Tuple['Component', Iterable[Requirement]]] | ||
|
||
LICENSE_FILE_REGEX = re.compile('LICEN[CS]E.*|COPYING.*') | ||
|
||
|
||
class EnvironmentBB(BomBuilder): | ||
|
||
|
@@ -106,6 +111,11 @@ def make_argument_parser(**kwargs: Any) -> 'ArgumentParser': | |
# `--local` If in a virtualenv that has global access, do not list globally-installed packages. | ||
# `--user` Only output packages installed in user-site. | ||
# `--path <path>` Restrict to the specified installation path for listing packages | ||
p.add_argument('--collect-evidence', | ||
help='Whether to collect license evidence from components', | ||
action='store_true', | ||
dest='collect_evidence', | ||
default=False) | ||
p.add_argument('python', | ||
metavar='<python>', | ||
help='Python interpreter', | ||
|
@@ -115,8 +125,10 @@ def make_argument_parser(**kwargs: Any) -> 'ArgumentParser': | |
|
||
def __init__(self, *, | ||
logger: 'Logger', | ||
collect_evidence: bool, | ||
**__: Any) -> None: | ||
self._logger = logger | ||
self._collect_evidence = collect_evidence | ||
|
||
def __call__(self, *, # type:ignore[override] | ||
python: Optional[str], | ||
|
@@ -144,6 +156,24 @@ def __call__(self, *, # type:ignore[override] | |
self.__add_components(bom, rc, path=path) | ||
return bom | ||
|
||
@staticmethod | ||
def __collect_license_evidence(dist: Distribution) -> Optional[ComponentEvidence]: | ||
if not (dist_files := dist.files): | ||
return None | ||
|
||
if not (license_files := [f for f in dist_files if LICENSE_FILE_REGEX.match(f.name)]): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No need to guess the name of the file. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @jkowalleck maybe the scope of this PR differs slightly from the issue description, but one of my main goals with this was to also collect licenses that the maintainers do not declare. For example, with dual licenses or in bundled and vendored code or where we can't trust maintainers to provide all the fields. So when we build SBOMs internally at my employer I'd then get complaints that SBOMs are incomplete/inaccurate. This tries to address parts of that as well. Take this example of the popular mkdocs-material theme (https://github.com/squidfunk/mkdocs-material): import re
from importlib.metadata import distribution
dist = distribution("mkdocs-material")
LICENSE_FILE_REGEX = re.compile('LICEN[CS]E.*|COPYING.*')
dist.metadata.get_all('License-File')
> ['LICENSE']
[f for f in dist.files if LICENSE_FILE_REGEX.match(f.name)]
> [PackagePath('material/templates/.icons/fontawesome/LICENSE.txt'), PackagePath('material/templates/.icons/material/LICENSE'), PackagePath('material/templates/.icons/octicons/LICENSE'), PackagePath('material/templates/.icons/simple/LICENSE.md'), PackagePath('mkdocs_material-9.4.7.dist-info/licenses/LICENSE')] Or just tried another one, https://pypi.org/project/packaging/ with a dual license: dist = distribution("packaging")
dist.metadata.get_all('License-File')
>
[f for f in dist.files if LICENSE_FILE_REGEX.match(f.name)]
> [PackagePath('packaging-23.2.dist-info/LICENSE'), PackagePath('packaging-23.2.dist-info/LICENSE.APACHE'), PackagePath('packaging-23.2.dist-info/LICENSE.BSD')] Maybe it's slightly different scopes but I think it's worth adding all those licenses, WDYT? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i see. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok, thanks - so if I understand correctly we should try make a set out of the collected and maintainer-declared files right? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. you might consider preventing duplicates in certain means, yes. |
||
return None | ||
|
||
licenses = [] | ||
for license_file in license_files: | ||
license_name = f'License detected in {license_file}' | ||
encoded_content = b64encode(license_file.read_binary()).decode('ascii') | ||
license_text = AttachedText(content=encoded_content, encoding=Encoding.BASE_64) | ||
license = DisjunctiveLicense(name=license_name, text=license_text) | ||
licenses.append(license) | ||
|
||
return ComponentEvidence(licenses=licenses) | ||
|
||
def __add_components(self, bom: 'Bom', | ||
rc: Optional[Tuple['Component', Iterable['Requirement']]], | ||
**kwargs: Any) -> None: | ||
|
@@ -165,6 +195,10 @@ def __add_components(self, bom: 'Bom', | |
# path of dist-package on disc? naaa... a package may have multiple files/folders on disc | ||
) | ||
del dist_meta, dist_name, dist_version | ||
|
||
if self._collect_evidence and (evidence := self.__collect_license_evidence(dist)): | ||
component.evidence = evidence | ||
|
||
self.__component_add_extred_and_purl(component, packagesource4dist(dist)) | ||
all_components[normalize_packagename(component.name)] = ( | ||
component, | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
files to consider:
LICENCE
/LICENSE
- obviouslyNOTICE
- some license texts, like Apache-2.0, explicitly state this exact file (name) as a optional source of custom license addendumCOPYING
- obviouslyread also: https://wheel.readthedocs.io/en/stable/user_guide.html?#including-license-files-in-the-generated-wheel-file
They describe this already, and they also allow arbitrary file suffixes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah nice catch! I think I got the original idea for the pattern from another license finder.
I'm thinking with
AUTHORS
andNOTICE
it's a bit less clear if some of the results should be considered license files in the sense that I'd expect in a CycloneDX SBOM, but perhaps it still makes sense. See e.g.https://github.com/CycloneDX/cyclonedx-python/blob/main/NOTICE (maybe it counts.. not sure)
https://github.com/python-gitlab/python-gitlab/blob/main/AUTHORS
Suffixes should already be covered but I'll make sure to add some tests for this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
regarding
NOTICE
file:such a file is a case of an addendum to standard license texts.
example:
the https://github.com/CycloneDX/cyclonedx-python/blob/main/NOTICE counts, since it is anchored in the apache-2.0 license - see https://github.com/CycloneDX/cyclonedx-python/blob/main/LICENSE#L106-L121
the
AUTHORS
is unclear, this is true.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jkowalleck thanks! Ok makes sense. WDYT should we skip
AUTHORS
or add it but make the pattern configurable? (this would make more sense if there was a tiny high-level public API for programmatic usage perhaps, I'll open a separate issue to discuss).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
skip authors for now. if it is needed, somebody can add the feature later.
PS: Oftern i've seen
AUTHORS
being just a huge collection of email addresses and names with not much benefit. If it had any legal meaning, it would be where it belonged, instead of being an extra file. I am no lawyer and i did not read all license texts, maybe it is similar to theNOTICE
file ... IDK