Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve debian license detection #2390 #2558

Merged
merged 34 commits into from
Jul 2, 2021

Conversation

pombredanne
Copy link
Member

@pombredanne pombredanne commented Jun 15, 2021

Improves Debian License Detection by making the following enhancements:

  1. Adds license_matches property to get all matches detected in a Structured file.
  2. Better Filtering in unstructured debian copyright file for license intros
  3. Adds license rules from copyright files for debin-slim-buster to improve license detection
  4. Adds tests to verify 100% license detection accuracy on debian-slim copyright files (no unknown-license-reference)
  5. Adds consistency check option for debian copyright files
  6. Detect structured copyright fle more accuately
  7. Add checks if substitution is carried out propery
  8. Remove none and unknown copyrights
  9. Add license detection from extra_data
  10. Delete old functions and debian_licenses.txt
  11. Adds fix for Failure: "Bad arguments: all arguments must be an Expression" scancode.io#219

See:

Tasks

  • Reviewed contribution guidelines
  • PR is descriptively titled 📑 and links the original issue above 🔗
  • Tests pass -- look for a green checkbox ✔️ a few minutes after opening your PR
    Run tests locally to check for errors.
  • Commits are in uniquely-named feature branch and has no merge conflicts 📁

Uses filter_licenses flag to return all license matches and not
only unique ones in the case of unstructured copyright files tests,
which have with_details as True.

Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
Add functions to check consistency of a debian copuright file, which
if enabled, raises an exception in the following cases:

- Unstructured File
- Has other paragraphs detected
- Has dupicate license paragraphs
- Has paragraphs with license but not license name
- All licenses in license paragraphs are not used
- License expressions are not parsable

Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
@AyanSinhaMahapatra AyanSinhaMahapatra force-pushed the 2390-improve-debian-license-detection branch 2 times, most recently from 942f592 to b81cb96 Compare June 17, 2021 07:29
Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
@AyanSinhaMahapatra AyanSinhaMahapatra force-pushed the 2390-improve-debian-license-detection branch from b81cb96 to cb074fc Compare June 23, 2021 11:29
Adds a filter for unstructured debian copyright file, where if
license intros are perfectly matched, they are discarded, as
in the context of a debian copyright file, the license texts/notices
are also there and not just the intro.

Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
Adds tests which fail if there is a unknown license
detection or a license detection issue with low match
coverage present in the test cases. Also traces the
detections in case of failures. Fixes some text expectations.

Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
Add new rules and modify existing rules to get debian-slim license
detections correct.

Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
Modify test expectations after license detection improvements in debian-slim
copyright files.

Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
Improves license detection by modifying unstructured license
intro detection with full coverage. Fixes matched_text bug in unknown
debian license by setting the lines. Removes unknown copyright detections.

Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
Adds license_matches property to get the LicenseMatch objects
out of LicenseDetection objects directly as a property.

Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
Structured debian copyright files are deteected by the 'format: '
first line, and this adds more format links commonly encountered in
debian copyright files, thus detecting structured copyright files
better.

Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
Adds rules to remove unknown-license-references in common debian
copyright files. Regenerates test files with removed unknowns.

Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
Use the licensing.dedup function from license-expression library
to simplify the licenses without losing any license-expression
specific information.

Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
For structured debian copyright files, return an attribute
with the primary license detected, and for unstructured files
return None.

Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
Adds the debian copyright file which caused an exception, as
it says it's a structured debian copyright file, but doesn't
have structured paragraphs in it.

See - aboutcode-org/scancode.io#219

Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
@AyanSinhaMahapatra AyanSinhaMahapatra force-pushed the 2390-improve-debian-license-detection branch from 8b050bc to 95568bc Compare June 28, 2021 16:28
Fixes bug in get_license_expression by not adding a LicenseDetection
object when there are no license matches in an other paragraph,
in a structured debian copyright file.

See - aboutcode-org/scancode.io#219

Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
Modify get_license_expression to not pass None values to
combine_expression, and also handle if all license detections
are None, by getting a expression from the license_matches,
or raising an Error if no license_matches.

See - aboutcode-org/scancode.io#219

Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
@AyanSinhaMahapatra AyanSinhaMahapatra force-pushed the 2390-improve-debian-license-detection branch from 95568bc to ffdec4e Compare June 28, 2021 16:40
Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
Remove filter_licenses flag from get_license_expression functions
as licensing.dedup() makes this function redundant. Also renames
filter_licenses to filter_duplicates.

Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
Remove copyrights which starts with "none", like
`Copyright: none`. Regenerate tests to remove nones
from results.

Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
Move consistency error function into the EnhancedDebianCopyright class
and have the error textx added directly instead of a dict lookup.

Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
Rename the rules appropriately, with some modifications and additions.

Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
Adds a function to check whether the keys in debian expression can be substituted
successfully, and adds an UnknownMatch if there are inconsistencies.

Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
@AyanSinhaMahapatra AyanSinhaMahapatra force-pushed the 2390-improve-debian-license-detection branch from 4624992 to 467d23d Compare July 2, 2021 06:53
Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
Adds a function to do license detection on the license name
if it is not in the seen keys. Regenerates test and adds rules.

Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
Removes old debian copyright parsing functions and also
removes debian_licenses.txt.

Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
@AyanSinhaMahapatra
Copy link
Member

@pombredanne Tests are all green now.

Copy link
Member Author

@pombredanne pombredanne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! merging ... thank you ++ that's a big one.
Separately we will need a CHANGELOG entry 👍

@pombredanne pombredanne merged commit 3f7da81 into develop Jul 2, 2021
@pombredanne pombredanne deleted the 2390-improve-debian-license-detection branch July 2, 2021 16:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants