Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(license): reorder logic of how python package licenses are acquired #6220

Merged
merged 4 commits into from
Mar 8, 2024

Conversation

dus7eh
Copy link
Contributor

@dus7eh dus7eh commented Feb 28, 2024

Description

Change precedence of fields when reading python package licenses
Kept "License-expression" with highest importance although pep describing it is in draft state.
Then, trove license classifiers are read and if not present "License" field is taken into account. Lastly without changes "License-File".

Related issues

Checklist

  • I've read the guidelines for contributing to this repository.
  • I've followed the conventions in the PR title.
  • I've added tests that prove my fix is effective or that my feature works.
  • I've updated the documentation with the relevant information (if needed).
  • I've added usage information (if the PR introduces new options)
  • I've included a "before" and "after" example to the description (if the PR is a user interface change).

@dus7eh dus7eh force-pushed the fix/python-pkg-lic-reorder branch from ee3cc93 to d990009 Compare February 28, 2024 09:50
@dus7eh dus7eh force-pushed the fix/python-pkg-lic-reorder branch from d990009 to d82a816 Compare February 28, 2024 13:28
Copy link
Contributor

@DmitriyLewen DmitriyLewen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @dus7eh
Thanks for your work!

I did some refactoring. Look here please. If you agree with this, please update the tests.
Also i left some comments.

@@ -20,8 +20,6 @@ func NewParser() types.Parser {
return &Parser{}
}

// Parse parses egg and wheel metadata.
// e.g. .egg-info/PKG-INFO and dist-info/METADATA
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you have reason to remove this comment?

Comment on lines 94 to 96
{Name: "pyphen", Version: "0.14.0", License: "GNU General Public License v2 or later (GPLv2+)"},
{Name: "pyphen", Version: "0.14.0", License: "GNU Lesser General Public License v2 or later (LGPLv2+)"},
{Name: "pyphen", Version: "0.14.0", License: "Mozilla Public License 1.1 (MPL 1.1)"},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your solution creates duplicate packages. It is not right.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can reduce the length of file by removing lines that are unnecessary for tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

@dus7eh
Copy link
Contributor Author

dus7eh commented Feb 29, 2024

Hello @dus7eh Thanks for your work!

I did some refactoring. Look here please. If you agree with this, please update the tests. Also i left some comments.

I've fixed most but now the licenses get incorretly split. However we might treat this as a different bug.

Context:
The thing is that python trove classifiers (https://pypi.org/classifiers/) define a number of licenses with or and and words in their name (e.g. "License :: OSI Approved :: Historical Permission Notice and Disclaimer (HPND)"). Looking at the output I can see that during some additional post processing such texts are split. Thus, for example for the pyphen package I get 5 license entries instead of 3 in the final report:

  • GNU General Public License v2
  • later (GPLv2+)
  • GNU Lesser General Public License v2
  • later (LGPLv2+)
  • Mozilla Public License 1.1 (MPL 1.1)

@DmitriyLewen
Copy link
Contributor

DmitriyLewen commented Mar 1, 2024

I assumed that such a situation was possible 😞
But we can't correct logic that violates other logic.
That's why we want to create new structure for licenses (a string is no longer sufficient for all our cases).
At this point we can try to update our license splitting logic.
We can also try using a different separator so that the license separation log works correctly.

@dus7eh
Copy link
Contributor Author

dus7eh commented Mar 4, 2024

I believe that before doing and/or splits we should make some verification against known license names.
I could go through SPDX licence list (https://spdx.org/licenses/) and/or OSI Approved (https://opensource.org/licenses) which are language agnostic to see if there's a match. What do you think about this?
Logical flow:

  • split with , delimeter which as I see is generated on the license parsing stage
  • match with SPDX/OSI approved licenses
  • if failed split on and and or

In the future, when the new structure is in place similar verification could be done for different languages at lower level. For example for python against trove license classifiers (https://pypi.org/classifiers/). Then such license could be flagged as "NonSplittable" type.

@DmitriyLewen
Copy link
Contributor

Logical flow:

This is good way. But i see one problem with p1:
There are licenses that include , (e.g. Apache License, Version 2.0).
We need to think about how we can handle them.

@dus7eh
Copy link
Contributor Author

dus7eh commented Mar 5, 2024

Apache License, Version 2.0

Well, so the solution here would be to join licenses on a different char or set of characters

@DmitriyLewen
Copy link
Contributor

@dus7eh I added some new logic to license normalize function (3d9076d)

@dus7eh dus7eh requested review from simar7 and nikpivkin as code owners March 6, 2024 09:53
@dus7eh dus7eh force-pushed the fix/python-pkg-lic-reorder branch from a848a9f to 3d9076d Compare March 6, 2024 09:55
@dus7eh
Copy link
Contributor Author

dus7eh commented Mar 6, 2024

@dus7eh I added some new logic to license normalize function (3d9076d)

This increases logical complexity but works well for now :) Let's merge it and I can apply the described approach as a separate PR if you're interested

@DmitriyLewen
Copy link
Contributor

Let's merge it and I can apply the described approach as a separate PR if you're interested

Do you want to move new logic for license normalization to another PR?

@dus7eh
Copy link
Contributor Author

dus7eh commented Mar 7, 2024

Let's merge it and I can apply the described approach as a separate PR if you're interested

Do you want to move new logic for license normalization to another PR?

No, I'm fine with that in this PR.

Copy link
Contributor

@DmitriyLewen DmitriyLewen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dus7eh Thanks for your work and investigation.

@knqyf263 take a look, when you have time, please

@knqyf263 knqyf263 added this pull request to the merge queue Mar 8, 2024
Merged via the queue into aquasecurity:main with commit 56cedc0 Mar 8, 2024
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants