Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve debian license detection #2390 #2518

Merged
merged 26 commits into from
Jun 3, 2021

Conversation

pombredanne
Copy link
Member

@pombredanne pombredanne commented May 7, 2021

This PR is to improve how we handle Debian license detection in copyright files for #2390

pombredanne and others added 6 commits April 8, 2021 12:59
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
The current test for debian copyright files was wrong and misleading.
This corrects the problem by having proper values in plain expected
files and in detailed files.

There was also a problem of test name masking where both detailed and
non-detailed test methods had the same name and therefore were
not running correctly at all.

As a result all expected YAML files have been regenerated too.

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
This is the set of files found in a recent debian-unstable-slim Docker
image. The expectations have been regenerated as-is but not yet
revewied.

See also:
- aboutcode-org/scancode.io#128
- aboutcode-org/scancode.io#103

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Refactor debian copyright detection to add DebianCopyrightDetector class,
makes changes to facilitate better copyright file parsing.

Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
Fix bug in unstructured copyright file parsing, which always treated
copyright files as structured, and regenerate tests files.

Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
Remove `unique` and `simplify_licenses` to have non-unique and
non-simplifies copyright and license information. Use with_debian_packaging
instead of using with_details and skip_debian_packaging.
Regenerates test for to update expectations.

Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
@pombredanne
Copy link
Member Author

The overall approach to properly detect licenses in copyright files is this:

  • we have a specific process for files that are no machine-readable, treating them mostly as plain text, and using some heuristics to break them in chunks as needed

  • for machine-readable files, they are parsed in paragraphs and processed this way:

  1. handle license paragraphs first:
    a. detect the license in the text/body and/or notes/comments
    b. parse the License:tag and determine if this is a common license or not
    c. compare, reconcile and validate a. and b. consistency and eventually create and store a new license "symbol" reference from b. using the detected license in a.

  2. handle the files paragraphs and the header paragraph second, using the list of new license symbols created in 1.
    a. detect the license in the text/body and/or notes/comments
    b. parse the License:tag as a license expression using the common license symbols and the symbols found in 1.
    c. compare, reconcile and validate a. and b. consistency and issues

  3. report results as a list of license matches

Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
This commit adds new functions for parsing Structured Debian copyrights
for license and copyright detections.

Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
Adds license detection from comments and other paragraphs, regenerates
test files.

Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
Adds with_details flag to filter license detections, by not reporting license
same as header/primary license and also only reporting unique license references
in file paragraph. Fix bugs related to identifying debian/primary_license
paragraphs. Regenerates test expectations.

Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
@AyanSinhaMahapatra AyanSinhaMahapatra force-pushed the 2390-improve-debian-license-detection branch from a0cf65e to 2a503d9 Compare May 24, 2021 03:38
Modify EnhancedDebianCopyright to be a DebianCopyright wrapper function
and modify flags used for filtering and reporting. Seperate structured
and unstructured parsing into different classes having the same base class
and main methods.
Also modify file to follow black standards.

Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
Updates get_installed_packages to directly call parse_copyright_file function
and get an object depending on structured/unstructured copyright file and
then call functions with filtering flags to get detections as required.

Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
Add tests for EnhancedDebianCopyright class and also modify test functions
to adopt the new API.

Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
This makes declared_license also report declared license in the license
paragraph of debian copyright files. Updates test expectations.

Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
Modify get_copyrights to have unique copyrights when the
unique_copyrights flag is set to True.

Refer to #2390

Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
Regenerate test expectations after upgrading to latest debian-inspector
to parse paragraphs after double empty lines correctly, as the latest
version fixes this issue.

Refer to #2390
Refer to aboutcode-org/debian-inspector#17

Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
@AyanSinhaMahapatra AyanSinhaMahapatra force-pushed the 2390-improve-debian-license-detection branch from 4684c98 to 3eb1808 Compare May 28, 2021 10:52
Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
Instead of adding a general `unknown_debian_license` rule, create
a synthetic UnknownRule object and a LicenseMatch object out of the
unknown license text. Also updates test expectations after reindexing
licenses with new rules added from develop branch.

Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
For debian packages which have the same copyright, delete one
from tests.

Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
Update requirements and setup.cfg files to install the latest
debian-inspector version 21.5.25 to fix the following issue:
aboutcode-org/debian-inspector#17

Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
@AyanSinhaMahapatra AyanSinhaMahapatra force-pushed the 2390-improve-debian-license-detection branch from 0b034af to 2e26e25 Compare June 2, 2021 09:35
@AyanSinhaMahapatra
Copy link
Member

All green!

Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
AyanSinhaMahapatra and others added 4 commits June 2, 2021 22:56
Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
@pombredanne
Copy link
Member Author

All green at last! merging now.
@AyanSinhaMahapatra thanks!

@pombredanne pombredanne merged commit e34bc1e into develop Jun 3, 2021
@pombredanne pombredanne deleted the 2390-improve-debian-license-detection branch June 3, 2021 20:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants