Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(environment): support collecting license evidence from installed distributions #674

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 37 additions & 3 deletions cyclonedx_py/_internal/environment.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,10 @@
# Copyright (c) OWASP Foundation. All Rights Reserved.


import re
from argparse import OPTIONAL, ArgumentParser
from importlib.metadata import distributions
from base64 import b64encode
from importlib.metadata import Distribution, distributions
from json import loads
from os import getcwd, name as os_name
from os.path import exists, isdir, join
Expand All @@ -26,8 +28,9 @@
from textwrap import dedent
from typing import TYPE_CHECKING, Any, Dict, Iterable, List, Optional, Tuple

from cyclonedx.model import Property
from cyclonedx.model.component import Component, ComponentType
from cyclonedx.model import AttachedText, Encoding, Property
from cyclonedx.model.component import Component, ComponentEvidence, ComponentType
from cyclonedx.model.license import DisjunctiveLicense
from packageurl import PackageURL
from packaging.requirements import Requirement

Expand All @@ -47,6 +50,8 @@

T_AllComponents = Dict[str, Tuple['Component', Iterable[Requirement]]]

LICENSE_FILE_REGEX = re.compile('LICEN[CS]E.*|COPYING.*')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

files to consider:

  • LICENCE/LICENSE - obviously
  • NOTICE - some license texts, like Apache-2.0, explicitly state this exact file (name) as a optional source of custom license addendum
  • COPYING - obviously

read also: https://wheel.readthedocs.io/en/stable/user_guide.html?#including-license-files-in-the-generated-wheel-file
They describe this already, and they also allow arbitrary file suffixes.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah nice catch! I think I got the original idea for the pattern from another license finder.

I'm thinking with AUTHORS and NOTICE it's a bit less clear if some of the results should be considered license files in the sense that I'd expect in a CycloneDX SBOM, but perhaps it still makes sense. See e.g.
https://github.com/CycloneDX/cyclonedx-python/blob/main/NOTICE (maybe it counts.. not sure)
https://github.com/python-gitlab/python-gitlab/blob/main/AUTHORS

Suffixes should already be covered but I'll make sure to add some tests for this!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regarding NOTICE file:
such a file is a case of an addendum to standard license texts.
example:
the https://github.com/CycloneDX/cyclonedx-python/blob/main/NOTICE counts, since it is anchored in the apache-2.0 license - see https://github.com/CycloneDX/cyclonedx-python/blob/main/LICENSE#L106-L121

the AUTHORS is unclear, this is true.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jkowalleck thanks! Ok makes sense. WDYT should we skip AUTHORS or add it but make the pattern configurable? (this would make more sense if there was a tiny high-level public API for programmatic usage perhaps, I'll open a separate issue to discuss).

Copy link
Member

@jkowalleck jkowalleck Feb 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

skip authors for now. if it is needed, somebody can add the feature later.

PS: Oftern i've seen AUTHORS being just a huge collection of email addresses and names with not much benefit. If it had any legal meaning, it would be where it belonged, instead of being an extra file. I am no lawyer and i did not read all license texts, maybe it is similar to the NOTICE file ... IDK



class EnvironmentBB(BomBuilder):

Expand Down Expand Up @@ -106,6 +111,11 @@ def make_argument_parser(**kwargs: Any) -> 'ArgumentParser':
# `--local` If in a virtualenv that has global access, do not list globally-installed packages.
# `--user` Only output packages installed in user-site.
# `--path <path>` Restrict to the specified installation path for listing packages
p.add_argument('--collect-evidence',
help='Whether to collect license evidence from components',
action='store_true',
dest='collect_evidence',
default=False)
p.add_argument('python',
metavar='<python>',
help='Python interpreter',
Expand All @@ -115,8 +125,10 @@ def make_argument_parser(**kwargs: Any) -> 'ArgumentParser':

def __init__(self, *,
logger: 'Logger',
collect_evidence: bool,
**__: Any) -> None:
self._logger = logger
self._collect_evidence = collect_evidence

def __call__(self, *, # type:ignore[override]
python: Optional[str],
Expand Down Expand Up @@ -144,6 +156,24 @@ def __call__(self, *, # type:ignore[override]
self.__add_components(bom, rc, path=path)
return bom

@staticmethod
def __collect_license_evidence(dist: Distribution) -> Optional[ComponentEvidence]:
if not (dist_files := dist.files):
return None

if not (license_files := [f for f in dist_files if LICENSE_FILE_REGEX.match(f.name)]):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to guess the name of the file.
You might be able to access all names via dist.metadata.get_all('License-File')
this is similar to how declared component licenses are collected.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jkowalleck maybe the scope of this PR differs slightly from the issue description, but one of my main goals with this was to also collect licenses that the maintainers do not declare. For example, with dual licenses or in bundled and vendored code or where we can't trust maintainers to provide all the fields.

So when we build SBOMs internally at my employer I'd then get complaints that SBOMs are incomplete/inaccurate. This tries to address parts of that as well.

Take this example of the popular mkdocs-material theme (https://github.com/squidfunk/mkdocs-material):

import re
from importlib.metadata import distribution
dist = distribution("mkdocs-material")

LICENSE_FILE_REGEX = re.compile('LICEN[CS]E.*|COPYING.*')
dist.metadata.get_all('License-File')
> ['LICENSE']
[f for f in dist.files if LICENSE_FILE_REGEX.match(f.name)]
> [PackagePath('material/templates/.icons/fontawesome/LICENSE.txt'), PackagePath('material/templates/.icons/material/LICENSE'), PackagePath('material/templates/.icons/octicons/LICENSE'), PackagePath('material/templates/.icons/simple/LICENSE.md'), PackagePath('mkdocs_material-9.4.7.dist-info/licenses/LICENSE')]

Or just tried another one, https://pypi.org/project/packaging/ with a dual license:

dist = distribution("packaging")
dist.metadata.get_all('License-File')
> 
[f for f in dist.files if LICENSE_FILE_REGEX.match(f.name)]
> [PackagePath('packaging-23.2.dist-info/LICENSE'), PackagePath('packaging-23.2.dist-info/LICENSE.APACHE'), PackagePath('packaging-23.2.dist-info/LICENSE.BSD')]

Maybe it's slightly different scopes but I think it's worth adding all those licenses, WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i see.
then maybe do your auto-detection, and add the declared license files, too.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thanks - so if I understand correctly we should try make a set out of the collected and maintainer-declared files right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you might consider preventing duplicates in certain means, yes.
To what extent? I'd leave the answer to you, as you are probably a user of this feature. I am certain you will make the right decision.

return None

licenses = []
for license_file in license_files:
license_name = f'License detected in {license_file}'
encoded_content = b64encode(license_file.read_binary()).decode('ascii')
license_text = AttachedText(content=encoded_content, encoding=Encoding.BASE_64)
license = DisjunctiveLicense(name=license_name, text=license_text)
licenses.append(license)

return ComponentEvidence(licenses=licenses)

def __add_components(self, bom: 'Bom',
rc: Optional[Tuple['Component', Iterable['Requirement']]],
**kwargs: Any) -> None:
Expand All @@ -165,6 +195,10 @@ def __add_components(self, bom: 'Bom',
# path of dist-package on disc? naaa... a package may have multiple files/folders on disc
)
del dist_meta, dist_name, dist_version

if self._collect_evidence and (evidence := self.__collect_license_evidence(dist)):
component.evidence = evidence

self.__component_add_extred_and_purl(component, packagesource4dist(dist))
all_components[normalize_packagename(component.name)] = (
component,
Expand Down