Update license detection #2505

pombredanne · 2021-04-22T22:19:12Z

This is an omnibus PR for license detection updates for April 2021

Tasks

Reviewed contribution guidelines
PR is descriptively titled 📑 and links the original issue above 🔗
Tests pass -- look for a green checkbox ✔️ a few minutes after opening your PR
Run tests locally to check for errors.
Commits are in uniquely-named feature branch and has no merge conflicts 📁

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

@LeChasseur

Reported-by: Till Jaeger @LeChasseur Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

In particular some functions need all scancode licenses but not always. Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

And add tests for npm.compute_normalized_license Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

These are generated sequences of false positive licenses tyically found in license lists and license handling tools. Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

- Instead keep them until later stages of license matches refinements otherwise some false positive may be ignored. - Also streamline debug tracing printouts Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Adn update corresponding tests Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Add "only_known_words: yes" flag to these short GPL rules that are otherwise too often spurious false positive detections. Reported-by: Pierre Tardy <pierre.tardy@renault.com> Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Reported-by: Pierre Tardy <pierre.tardy@renault.com> Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Remove function licensedcode.cache.build_licenses_db as this is only a thin wrapper on licensedcode.models.load_licenses Also remove lower() in license URLs check Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Also remove unused MAX_DIST variable reference in refine_matches() Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

In refine_matches() call filter_if_only_known_words_rule() later in the process to ensure that small contained rules are not left at the end. Also format code Reported-by: Pierre Tardy <pierre.tardy@renault.com> Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

We need this such that we later treat stopwords as if they are unknown words. Reported-by: Pierre Tardy <pierre.tardy@renault.com> Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Calling tokens_by_line() with location and query_string arguments makes the code clearer and easier to read. Also apply minor formattings Reported-by: Pierre Tardy <pierre.tardy@renault.com> Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Run tokens_by_line() first without a stopwords argument. This allows stopword tokens to be included in the token stream and later to be treated as "unknown" tokens. This way the presence of stopwords in a match can impact a license match score and a license rule with `with_only_known_words: yes` annot be matched not only if there are unknown words but also if there are stopwords mixed its rule words. Also add a simple end-to-end integration test for this. Reported-by: Pierre Tardy <pierre.tardy@renault.com> Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Use KDE's own license refs. Make these regular licenses, not exceptions. Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Weed out some false positive detections Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Correct qualification of bsd-new and bsd-simplified rules that were incorrectly qualified as one or the other. Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

There are notices, not texts Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

For spdx_license_key, we use LicenseRef-scancode-* And we track other LicenseRef in other_spdx_license_keys Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Having a single function is nice but means there is an overhead on each call. Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

- Add new LicenseMatch.qcontains_stopwords() to check if stopwords are included in the matched range. And use function in filter_if_only_known_words_rule() filter. - Simplify handling of stop words in matched text processing. - Use separate index and query tokenizer functions Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Also add new index_tokenizer tests Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

pombredanne · 2021-04-23T06:41:35Z

Merging this now. There are many new rules, including several generated rules for false positives list of SPDX license ids often found in license-related packages code and their test files.

sthagen · 2021-04-25T13:12:34Z

src/licensedcode/data/rules/gpl-3.0_gpl_30_bare_words.yml

@@ -1,3 +1,5 @@
 license_expression: gpl-3.0
 is_license_tag: yes
-relevance: 100
+relevance: 60
+only_known_words: yes


pombredanne added 30 commits April 14, 2021 15:54

Add new SSPL detection rule

1a16236

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Add new aand improved license detection rules #2404

adff0dc

Reported-by: Till Jaeger @LeChasseur Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Update CHANGELOG

075d4d8

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Add new license detection rule

9114f6f

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Add minimum coverage

bbba3fa

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Enable using synclib as a library

b3609bc

In particular some functions need all scancode licenses but not always. Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Generate new FP rules from SPDX id sequences

d8a31be

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Correctly npm test for unknown licenses

bb40033

And add tests for npm.compute_normalized_license Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Add new false positive rules for SPDX ids

5f39252

These are generated sequences of false positive licenses tyically found in license lists and license handling tools. Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Streamline debug tracing printouts

3133a13

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Do not remove overlaping false positive matches

3a9f71b

- Instead keep them until later stages of license matches refinements otherwise some false positive may be ignored. - Also streamline debug tracing printouts Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Add new generated license false positive rules

2b7fb90

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Addnew misc. license detection rules

081bf90

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Add new and improved license detection rules

9c94995

Adn update corresponding tests Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Split license validation tests in five suites

841a5e9

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Make GPL rules less false-positive prone #2484

f13957e

Add "only_known_words: yes" flag to these short GPL rules that are otherwise too often spurious false positive detections. Reported-by: Pierre Tardy <pierre.tardy@renault.com> Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Add tests for false positive GPL detections #2484

4823b67

Reported-by: Pierre Tardy <pierre.tardy@renault.com> Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Remove redundant build_licenses_db fundtion

6e87760

Remove function licensedcode.cache.build_licenses_db as this is only a thin wrapper on licensedcode.models.load_licenses Also remove lower() in license URLs check Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Format code for readability

82e0c10

Also remove unused MAX_DIST variable reference in refine_matches() Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Allow query_tokenizer() call without stopwords #2484

eaa82f8

We need this such that we later treat stopwords as if they are unknown words. Reported-by: Pierre Tardy <pierre.tardy@renault.com> Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Rename variable names for clarity

0899c5f

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Add new or update KDE "Accepted" L/GPL licenses

26c048b

Use KDE's own license refs. Make these regular licenses, not exceptions. Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Add new and improved license detection rules

5e72f06

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Add new WPD license derived from the OGL license

6ed2456

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Remove other licenses from exception text

505e468

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Add bsla variant without advertizing clause

306e4cc

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Improve rules relevance

0ff7ac7

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

pombredanne added 24 commits April 17, 2021 17:26

Generate FP license rules not only from ngrams

e4fc12d

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Improve copyright detection

ca849c2

Weed out some false positive detections Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Update tests with latest expectations

7c6bcd9

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Ignore local tmp directories in tests

5b99659

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Add new license detection rules

2f0a51f

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Add new false positive license detection rules

e26ab80

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Improve rule relevance and coverage

10b927c

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Update test expectation to match latest rules

72aba1e

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Track stopwords in license queries and matches

94ee3ae

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Requalify some bsd-new and bsd-simplified rules

3d6e021

Correct qualification of bsd-new and bsd-simplified rules that were incorrectly qualified as one or the other. Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Correct is_license_text flags

6bb69a3

There are notices, not texts Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Treat consistently third-party SDPX licenseref

3b63933

For spdx_license_key, we use LicenseRef-scancode-* And we track other LicenseRef in other_spdx_license_keys Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Add new rules and improve existing license rules

8c2d12e

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Fix YAML syntax

6014264

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Bump relevance for SPDX id

a5610a1

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Rename test method for clarity

25e12aa

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Use separate index and query tokenizer functions

6b04eb4

Having a single function is nice but means there is an overhead on each call. Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Update query_tokenizer tests

db8a535

Also add new index_tokenizer tests Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Correctly track positions with stopwords present

265e7d3

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Use query_string argument where needed

28f29c9

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Add kde-accepted licenses to rules

ce990bc

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Align test expectations with latest rules set

2a2efc7

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Add new license detection rules

4a89a9b

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

pombredanne merged commit bb04420 into develop Apr 23, 2021

pombredanne deleted the 2021-04-license-updates branch April 23, 2021 06:41

sthagen reviewed Apr 25, 2021

View reviewed changes

This was referenced Aug 14, 2021

False positive: GPL instead of LGPL #2641

Closed

Improve false positive license detection for license lists #2651

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update license detection #2505

Update license detection #2505

pombredanne commented Apr 22, 2021 •

edited

Loading

pombredanne commented Apr 23, 2021

sthagen Apr 25, 2021

pombredanne Apr 26, 2021

Update license detection #2505

Update license detection #2505

Conversation

pombredanne commented Apr 22, 2021 • edited Loading

Tasks

pombredanne commented Apr 23, 2021

sthagen Apr 25, 2021

Choose a reason for hiding this comment

pombredanne Apr 26, 2021

Choose a reason for hiding this comment

pombredanne commented Apr 22, 2021 •

edited

Loading