fix(license): reorder logic of how python package licenses are acquired #6220

dus7eh · 2024-02-28T09:21:10Z

Description

Change precedence of fields when reading python package licenses
Kept "License-expression" with highest importance although pep describing it is in draft state.
Then, trove license classifiers are read and if not present "License" field is taken into account. Lastly without changes "License-File".

Related issues

Refers to discusion in don't split licenses from License field from python packaging #5204 issue

Checklist

I've read the guidelines for contributing to this repository.
I've followed the conventions in the PR title.
I've added tests that prove my fix is effective or that my feature works.
I've updated the documentation with the relevant information (if needed).
I've added usage information (if the PR introduces new options)
I've included a "before" and "after" example to the description (if the PR is a user interface change).

DmitriyLewen

Hello @dus7eh
Thanks for your work!

I did some refactoring. Look here please. If you agree with this, please update the tests.
Also i left some comments.

DmitriyLewen · 2024-02-29T04:58:50Z

pkg/dependency/parser/python/packaging/parse.go

@@ -20,8 +20,6 @@ func NewParser() types.Parser {
 	return &Parser{}
 }

-// Parse parses egg and wheel metadata.
-// e.g. .egg-info/PKG-INFO and dist-info/METADATA


Did you have reason to remove this comment?

DmitriyLewen · 2024-02-29T05:59:02Z

pkg/dependency/parser/python/packaging/parse_test.go

+				{Name: "pyphen", Version: "0.14.0", License: "GNU General Public License v2 or later (GPLv2+)"},
+				{Name: "pyphen", Version: "0.14.0", License: "GNU Lesser General Public License v2 or later (LGPLv2+)"},
+				{Name: "pyphen", Version: "0.14.0", License: "Mozilla Public License 1.1 (MPL 1.1)"},


Your solution creates duplicate packages. It is not right.

DmitriyLewen · 2024-02-29T06:00:42Z

pkg/dependency/parser/python/packaging/testdata/asyncssh-2.14.2.METADATA

I think you can reduce the length of file by removing lines that are unnecessary for tests.

DmitriyLewen · 2024-02-29T06:00:56Z

pkg/dependency/parser/python/packaging/testdata/pyphen-0.14.0.METADATA

dus7eh · 2024-02-29T11:42:13Z

Hello @dus7eh Thanks for your work!

I did some refactoring. Look here please. If you agree with this, please update the tests. Also i left some comments.

I've fixed most but now the licenses get incorretly split. However we might treat this as a different bug.

Context:
The thing is that python trove classifiers (https://pypi.org/classifiers/) define a number of licenses with or and and words in their name (e.g. "License :: OSI Approved :: Historical Permission Notice and Disclaimer (HPND)"). Looking at the output I can see that during some additional post processing such texts are split. Thus, for example for the pyphen package I get 5 license entries instead of 3 in the final report:

GNU General Public License v2
later (GPLv2+)
GNU Lesser General Public License v2
later (LGPLv2+)
Mozilla Public License 1.1 (MPL 1.1)

DmitriyLewen · 2024-03-01T04:14:13Z

I assumed that such a situation was possible 😞
But we can't correct logic that violates other logic.
That's why we want to create new structure for licenses (a string is no longer sufficient for all our cases).
At this point we can try to update our license splitting logic.
We can also try using a different separator so that the license separation log works correctly.

dus7eh · 2024-03-04T11:41:32Z

I believe that before doing and/or splits we should make some verification against known license names.
I could go through SPDX licence list (https://spdx.org/licenses/) and/or OSI Approved (https://opensource.org/licenses) which are language agnostic to see if there's a match. What do you think about this?
Logical flow:

split with , delimeter which as I see is generated on the license parsing stage
match with SPDX/OSI approved licenses
if failed split on and and or

In the future, when the new structure is in place similar verification could be done for different languages at lower level. For example for python against trove license classifiers (https://pypi.org/classifiers/). Then such license could be flagged as "NonSplittable" type.

DmitriyLewen · 2024-03-05T03:49:15Z

Logical flow:

This is good way. But i see one problem with p1:
There are licenses that include , (e.g. Apache License, Version 2.0).
We need to think about how we can handle them.

dus7eh · 2024-03-05T08:24:06Z

Apache License, Version 2.0

Well, so the solution here would be to join licenses on a different char or set of characters

…ses logic

DmitriyLewen · 2024-03-06T07:03:09Z

@dus7eh I added some new logic to license normalize function (3d9076d)

dus7eh · 2024-03-06T12:21:11Z

@dus7eh I added some new logic to license normalize function (3d9076d)

This increases logical complexity but works well for now :) Let's merge it and I can apply the described approach as a separate PR if you're interested

DmitriyLewen · 2024-03-07T03:18:34Z

Let's merge it and I can apply the described approach as a separate PR if you're interested

Do you want to move new logic for license normalization to another PR?

dus7eh · 2024-03-07T07:38:41Z

Let's merge it and I can apply the described approach as a separate PR if you're interested

Do you want to move new logic for license normalization to another PR?

No, I'm fine with that in this PR.

DmitriyLewen

@dus7eh Thanks for your work and investigation.

@knqyf263 take a look, when you have time, please

dus7eh requested review from knqyf263 and DmitriyLewen as code owners February 28, 2024 09:21

dus7eh force-pushed the fix/python-pkg-lic-reorder branch from ee3cc93 to d990009 Compare February 28, 2024 09:50

fix(license): reorder logic of how python package licenses are acquired

d82a816

dus7eh force-pushed the fix/python-pkg-lic-reorder branch from d990009 to d82a816 Compare February 28, 2024 13:28

refactor: save licenses from "Classifier: License" as string

ad213bc

DmitriyLewen reviewed Feb 29, 2024

View reviewed changes

chore(license): apply review fixes

00900f5

refactor: add later and python licence exceptions into SplitLicen…

3d9076d

…ses logic

dus7eh requested review from simar7 and nikpivkin as code owners March 6, 2024 09:53

dus7eh force-pushed the fix/python-pkg-lic-reorder branch from a848a9f to 3d9076d Compare March 6, 2024 09:55

DmitriyLewen approved these changes Mar 7, 2024

View reviewed changes

knqyf263 added this pull request to the merge queue Mar 8, 2024

Merged via the queue into aquasecurity:main with commit 56cedc0 Mar 8, 2024
23 checks passed

aqua-bot mentioned this pull request May 29, 2024

release: v0.52.0 [main] #6809

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(license): reorder logic of how python package licenses are acquired #6220

fix(license): reorder logic of how python package licenses are acquired #6220

dus7eh commented Feb 28, 2024 •

edited

Loading

DmitriyLewen left a comment

DmitriyLewen Feb 29, 2024

DmitriyLewen Feb 29, 2024

DmitriyLewen Feb 29, 2024

DmitriyLewen Feb 29, 2024

dus7eh commented Feb 29, 2024 •

edited

Loading

DmitriyLewen commented Mar 1, 2024 •

edited

Loading

dus7eh commented Mar 4, 2024 •

edited

Loading

DmitriyLewen commented Mar 5, 2024

dus7eh commented Mar 5, 2024

DmitriyLewen commented Mar 6, 2024

dus7eh commented Mar 6, 2024 •

edited

Loading

DmitriyLewen commented Mar 7, 2024

dus7eh commented Mar 7, 2024

DmitriyLewen left a comment

fix(license): reorder logic of how python package licenses are acquired #6220

fix(license): reorder logic of how python package licenses are acquired #6220

Conversation

dus7eh commented Feb 28, 2024 • edited Loading

Description

Related issues

Checklist

DmitriyLewen left a comment

Choose a reason for hiding this comment

DmitriyLewen Feb 29, 2024

Choose a reason for hiding this comment

DmitriyLewen Feb 29, 2024

Choose a reason for hiding this comment

DmitriyLewen Feb 29, 2024

Choose a reason for hiding this comment

DmitriyLewen Feb 29, 2024

Choose a reason for hiding this comment

dus7eh commented Feb 29, 2024 • edited Loading

DmitriyLewen commented Mar 1, 2024 • edited Loading

dus7eh commented Mar 4, 2024 • edited Loading

DmitriyLewen commented Mar 5, 2024

dus7eh commented Mar 5, 2024

DmitriyLewen commented Mar 6, 2024

dus7eh commented Mar 6, 2024 • edited Loading

DmitriyLewen commented Mar 7, 2024

dus7eh commented Mar 7, 2024

DmitriyLewen left a comment

Choose a reason for hiding this comment

dus7eh commented Feb 28, 2024 •

edited

Loading

dus7eh commented Feb 29, 2024 •

edited

Loading

DmitriyLewen commented Mar 1, 2024 •

edited

Loading

dus7eh commented Mar 4, 2024 •

edited

Loading

dus7eh commented Mar 6, 2024 •

edited

Loading