Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update license detection #2505

Merged
merged 57 commits into from
Apr 23, 2021
Merged
Show file tree
Hide file tree
Changes from 16 commits
Commits
Show all changes
57 commits
Select commit Hold shift + click to select a range
1a16236
Add new SSPL detection rule
pombredanne Apr 1, 2021
adff0dc
Add new aand improved license detection rules #2404
pombredanne Apr 1, 2021
075d4d8
Update CHANGELOG
pombredanne Apr 1, 2021
9114f6f
Add new license detection rule
pombredanne Apr 2, 2021
bbba3fa
Add minimum coverage
pombredanne Apr 10, 2021
b3609bc
Enable using synclib as a library
pombredanne Apr 10, 2021
d8a31be
Generate new FP rules from SPDX id sequences
pombredanne Apr 10, 2021
bb40033
Correctly npm test for unknown licenses
pombredanne Apr 10, 2021
5f39252
Add new false positive rules for SPDX ids
pombredanne Apr 10, 2021
3133a13
Streamline debug tracing printouts
pombredanne Apr 10, 2021
3a9f71b
Do not remove overlaping false positive matches
pombredanne Apr 10, 2021
2b7fb90
Add new generated license false positive rules
pombredanne Apr 10, 2021
081bf90
Addnew misc. license detection rules
pombredanne Apr 10, 2021
9c94995
Add new and improved license detection rules
pombredanne Apr 10, 2021
841a5e9
Split license validation tests in five suites
pombredanne Apr 11, 2021
f13957e
Make GPL rules less false-positive prone #2484
pombredanne Apr 12, 2021
4823b67
Add tests for false positive GPL detections #2484
pombredanne Apr 12, 2021
6e87760
Remove redundant build_licenses_db fundtion
pombredanne Apr 12, 2021
82e0c10
Format code for readability
pombredanne Apr 12, 2021
6492998
Run filter_if_only_known_words_rule() last #2484
pombredanne Apr 12, 2021
eaa82f8
Allow query_tokenizer() call without stopwords #2484
pombredanne Apr 12, 2021
06fc09b
Call tokens_by_line() with arguments #2484
pombredanne Apr 12, 2021
2038f28
Treat stopwords the same as unknown words #2484
pombredanne Apr 12, 2021
0899c5f
Rename variable names for clarity
pombredanne Apr 12, 2021
26c048b
Add new or update KDE "Accepted" L/GPL licenses
pombredanne Apr 12, 2021
5e72f06
Add new and improved license detection rules
pombredanne Apr 12, 2021
6ed2456
Add new WPD license derived from the OGL license
pombredanne Apr 14, 2021
505e468
Remove other licenses from exception text
pombredanne Apr 14, 2021
306e4cc
Add bsla variant without advertizing clause
pombredanne Apr 14, 2021
0ff7ac7
Improve rules relevance
pombredanne Apr 14, 2021
46b4092
Add new license detection rules
pombredanne Apr 14, 2021
49f32e6
Add new license rule for entsoe notice
pombredanne Apr 14, 2021
fb606a7
Merge remote-tracking branch 'upstream/develop' into 2021-04-license-…
pombredanne Apr 16, 2021
e4fc12d
Generate FP license rules not only from ngrams
pombredanne Apr 17, 2021
ca849c2
Improve copyright detection
pombredanne Apr 17, 2021
7c6bcd9
Update tests with latest expectations
pombredanne Apr 17, 2021
5b99659
Ignore local tmp directories in tests
pombredanne Apr 17, 2021
2f0a51f
Add new license detection rules
pombredanne Apr 17, 2021
e26ab80
Add new false positive license detection rules
pombredanne Apr 17, 2021
10b927c
Improve rule relevance and coverage
pombredanne Apr 17, 2021
72aba1e
Update test expectation to match latest rules
pombredanne Apr 17, 2021
94ee3ae
Track stopwords in license queries and matches
pombredanne Apr 17, 2021
3d6e021
Requalify some bsd-new and bsd-simplified rules
pombredanne Apr 18, 2021
6bb69a3
Correct is_license_text flags
pombredanne Apr 18, 2021
3b63933
Treat consistently third-party SDPX licenseref
pombredanne Apr 18, 2021
8c2d12e
Add new rules and improve existing license rules
pombredanne Apr 18, 2021
6014264
Fix YAML syntax
pombredanne Apr 18, 2021
a5610a1
Bump relevance for SPDX id
pombredanne Apr 19, 2021
25e12aa
Rename test method for clarity
pombredanne Apr 19, 2021
6b04eb4
Use separate index and query tokenizer functions
pombredanne Apr 19, 2021
284f4c5
Improve filter_if_only_known_words_rule()
pombredanne Apr 19, 2021
db8a535
Update query_tokenizer tests
pombredanne Apr 19, 2021
265e7d3
Correctly track positions with stopwords present
pombredanne Apr 19, 2021
28f29c9
Use query_string argument where needed
pombredanne Apr 19, 2021
ce990bc
Add kde-accepted licenses to rules
pombredanne Apr 19, 2021
2a2efc7
Align test expectations with latest rules set
pombredanne Apr 19, 2021
4a89a9b
Add new license detection rules
pombredanne Apr 19, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
The diff you're trying to view is too large. We only load the first 3000 changed files.
7 changes: 7 additions & 0 deletions CHANGELOG.rst
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,10 @@ License scanning:
- Add new command line option to filter ignorable copyrights when included
in licenses.

- Add new and improved license detection rules.
Thank you to:
- Sebastian Thomas @sebathomas
- Till Jaeger @LeChasseur


v21.3.31
Expand Down Expand Up @@ -93,6 +97,9 @@ Copyright scanning:
- Allow calling copyright detection from text lines to ease integration
Thank you to Jelmer Vernooij @jelmer

- Fixed copyright truncation bug
Thank you to Akanksha Garg @akugarg


Package scanning:
~~~~~~~~~~~~~~~~~
Expand Down
26 changes: 23 additions & 3 deletions azure-pipelines.yml
Original file line number Diff line number Diff line change
Expand Up @@ -61,15 +61,35 @@ jobs:
bin/pytest -n 3 -vvs --test-suite=all \
tests/licensedcode/test_detection_datadriven_external.py

license_validate1: |
license_validate_basic: |
bin/pytest -n 3 -vvs --test-suite=validate \
tests/licensedcode/test_detection_validate.py \
-k TestValidateLicenseBasic

license_validate2: |
license_validate_extended_1: |
bin/pytest -n 3 -vvs --test-suite=validate \
tests/licensedcode/test_detection_validate.py \
-k TestValidateLicenseExtended
-k TestValidateLicenseExtended1

license_validate_extended_2: |
bin/pytest -n 3 -vvs --test-suite=validate \
tests/licensedcode/test_detection_validate.py \
-k TestValidateLicenseExtended2

license_validate_extended_3: |
bin/pytest -n 3 -vvs --test-suite=validate \
tests/licensedcode/test_detection_validate.py \
-k TestValidateLicenseExtended3

license_validate_extended_4: |
bin/pytest -n 3 -vvs --test-suite=validate \
tests/licensedcode/test_detection_validate.py \
-k TestValidateLicenseExtended4

license_validate_extended_5: |
bin/pytest -n 3 -vvs --test-suite=validate \
tests/licensedcode/test_detection_validate.py \
-k TestValidateLicenseExtended5

license_cache: |
bin/pytest -n 3 -vvs --test-suite=all \
Expand Down
168 changes: 168 additions & 0 deletions etc/scripts/licenses/gen_spdx_lists_fp.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,168 @@
# -*- coding: utf-8 -*-
#
# Copyright (c) nexB Inc. and others. All rights reserved.
# ScanCode is a trademark of nexB Inc.
# SPDX-License-Identifier: Apache-2.0
# See http://www.apache.org/licenses/LICENSE-2.0 for the license text.
# See https://github.com/nexB/scancode-toolkit for support or download.
# See https://aboutcode.org for more information about nexB OSS projects.
#

import click

from licensedcode.tokenize import ngrams

import synclic

"""
A script to generate false-positive license detection rules from lists of SPDX
licenses.

Common license detection tools use list of SPDX licenses ids to support their operations.
As a result, we get a lot of matched licenses and in these cases, these are false positives.

Here we fetch all released SPDX licenses lists and generate false positives
using these approaches to have a reasonable set of combinations of license ids
as found in the wild:

1. For each SPDX license list release, we consider these lists:
- all IDs
- all non-deprecated IDs
- all licenses
- all non-deprecated licenses
- all exceptions
- all non-deprecated exceptions

We generate lists of ids only and list of ids and name

2. For each of these lists we sort them:
- respective case
- ignoring case

3. for each of these sorted list we collect sub-sequences of 6 license, one
per line and generate a false positive RULE from that.

If a RULE already exists, it will be skipped.
"""

TRACE = False

template = '''----------------------------------------
is_false_positive: yes
notes: a sequence of SPDX license ids and names is not a license
---
{}
'''


@click.command()
@click.argument(
'license_dir', type=click.Path(), metavar='DIR')

@click.argument(
# 'A buildrules-formatted file used to generate new licenses rules.')
'output', type=click.Path(), metavar='FILE')

@click.option(
'--commitish', type=str, default=None,
help='An optional commitish to use for SPDX license data instead of the latest release.')

@click.option(
# 'A buildrules-formatted file used to generate new licenses rules.')
'--from-list', default=None, type=click.Path(), metavar='LIST_FILE',
help='Use file with a list of entries to ignore instead')

@click.option(
'-t', '--trace', is_flag=True, default=False,
help='Print execution trace.')

@click.help_option('-h', '--help')
def cli(license_dir, output, commitish=None, from_list=None, trace=False):
"""
Generate ScanCode false-positive license detection rules from lists of SPDX
license. Save these in FILE for use with buildrules.

the `spdx` directory is used as a temp store for fetched SPDX licenses.
"""
global TRACE
TRACE = trace

if not from_list:
spdx_source = synclic.SpdxSource(external_base_dir=license_dir)

spdx_by_key = spdx_source.get_licenses(
commitish=commitish,
skip_oddities=False,
)

all_licenses_and_exceptions = []
all_licenses_and_exceptions_non_deprecated = []
licenses = []
exceptions = []
licenses_non_deprecated = []
exceptions_non_deprecated = []

lists_of_licenses = [
all_licenses_and_exceptions,
all_licenses_and_exceptions_non_deprecated,
licenses,
exceptions,
licenses_non_deprecated,
exceptions_non_deprecated,
]

for lspdx in spdx_by_key.values():
all_licenses_and_exceptions.append(lspdx)
is_deprecated = lspdx.is_deprecated
if not is_deprecated:
all_licenses_and_exceptions_non_deprecated.append(lspdx)
if lspdx.is_exception:
exceptions.append(lspdx)
if not is_deprecated:
exceptions_non_deprecated.append(lspdx)
else:
licenses.append(lspdx)
if not is_deprecated:
licenses_non_deprecated.append(lspdx)

lists_of_sorted_licenses = []
for lic_list in lists_of_licenses:
sorted_case_sensitive = sorted(lic_list, key=lambda x: x.spdx_license_key)

as_ids = [l.spdx_license_key for l in sorted_case_sensitive]
lists_of_sorted_licenses.append(as_ids)

as_id_names = [f'{l.spdx_license_key} {l.name}' for l in sorted_case_sensitive]
lists_of_sorted_licenses.append(as_id_names)

sorted_case_insensitive = sorted(lic_list, key=lambda x: x.spdx_license_key.lower())
as_ids = [l.spdx_license_key for l in sorted_case_insensitive]
lists_of_sorted_licenses.append(as_ids)

as_id_names = [f'{l.spdx_license_key} {l.name}' for l in sorted_case_insensitive]
lists_of_sorted_licenses.append(as_id_names)

else:
with open(from_list) as inp:
lists_of_sorted_licenses = [inp.read().splitlines(False)]

with open(output, 'w') as o:
for lic_list in lists_of_sorted_licenses:
write_ngrams(texts=lic_list, output=o)

o.write('----------------------------------------\n')


def write_ngrams(texts, output, _seen=set(), ngram_length=6):
"""
Write the texts list as ngrams to the output file-like object.
"""
for text in ['\n'.join(ngs) for ngs in ngrams(texts, ngram_length=ngram_length)]:
if text in _seen:
continue
_seen.add(text)
output.write(template.format(text))


if __name__ == '__main__':
cli()
41 changes: 29 additions & 12 deletions etc/scripts/licenses/synclic.py
Original file line number Diff line number Diff line change
Expand Up @@ -162,15 +162,15 @@ def __init__(self, external_base_dir):
if not exists(self.new_dir):
mkdir(self.new_dir)

def get_licenses(self, scancode_licenses, **kwargs):
def get_licenses(self, scancode_licenses=None, **kwargs):
"""
Return a mapping of key -> ScanCode License objects either fetched
externally or loaded from the existing `self.original_dir`
"""
print('Fetching and storing external licenses in:', self.original_dir)

licenses = []
for lic, text in self.fetch_licenses(scancode_licenses, **kwargs):
for lic, text in self.fetch_licenses(scancode_licenses=scancode_licenses, **kwargs):
try:
with io.open(lic.text_file, 'w', encoding='utf-8')as tf:
tf.write(text)
Expand Down Expand Up @@ -336,10 +336,19 @@ class SpdxSource(ExternalLicensesSource):
'notes',
)

def fetch_licenses(self, scancode_licenses, commitish=None, from_repo=SPDX_DEFAULT_REPO):
def fetch_licenses(
self,
scancode_licenses=None,
commitish=None,
skip_oddities=True,
from_repo=SPDX_DEFAULT_REPO,
):
"""
Yield License objects fetched from the latest SPDX license list.
Use the latest tagged version or the `commitish` is provided.
Yield License objects fetched from the latest SPDX license list. Use the
latest tagged version or the `commitish` if provided.
If skip_oddities is True, some oddities are skipped or handled
specially, such as licenses with a trailing + or foreign language
licenses.
"""
if not commitish:
# get latest tag
Expand All @@ -361,32 +370,40 @@ def fetch_licenses(self, scancode_licenses, commitish=None, from_repo=SPDX_DEFAU
and ('/json/details/' in path or '/json/exceptions/' in path)):
continue
if TRACE_FETCH: print('Loading license:', path)
if path.endswith('+.json'):
if skip_oddities and path.endswith('+.json'):
# Skip the old plus licenses. We use them in
# ScanCode, but they are deprecated in SPDX.
continue
details = json.loads(archive.read(path))
lic = self.build_license(details, scancode_licenses)
lic = self.build_license(
mapping=details,
scancode_licenses=scancode_licenses,
skip_oddities=skip_oddities,
)

if lic:
yield lic

def build_license(self, mapping, scancode_licenses):
def build_license(self, mapping, skip_oddities=True, scancode_licenses=None):
"""
Return a ScanCode License object built from an SPDX license mapping.
If skip_oddities is True, some oddities are skipped or handled
specially, such as licenses with a trailing + or foreign language
licenses.
"""
spdx_license_key = mapping.get('licenseId') or mapping.get('licenseExceptionId')
assert spdx_license_key
spdx_license_key = spdx_license_key.strip()
key = spdx_license_key.lower()

# TODO: Not yet available in ScanCode
is_foreign = key in scancode_licenses.non_english_by_spdx_key
if is_foreign:
is_foreign = scancode_licenses and key in scancode_licenses.non_english_by_spdx_key
if skip_oddities and is_foreign:
if TRACE: print('Skipping NON-english license FOR NOW:', key)
return

# these keys have a complicated history
if key in set([
if skip_oddities and key in set([
'gpl-1.0', 'gpl-2.0', 'gpl-3.0',
'lgpl-2.0', 'lgpl-2.1', 'lgpl-3.0',
'agpl-1.0', 'agpl-2.0', 'agpl-3.0',
Expand All @@ -399,7 +416,7 @@ def build_license(self, mapping, scancode_licenses):
return

deprecated = mapping.get('isDeprecatedLicenseId', False)
if deprecated:
if skip_oddities and deprecated:
# we use concrete keys for some plus/or later versions for
# simplicity and override SPDX deprecation for these
if key.endswith('+'):
Expand Down
4 changes: 4 additions & 0 deletions src/licensedcode/data/rules/apache-2.0_741.RULE
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
This product includes software developed by
The Apache Software Foundation (http://www.apache.org/).
// NOTICE file corresponding to the section 4d of The Apache License,
// Version 2.0,
7 changes: 7 additions & 0 deletions src/licensedcode/data/rules/apache-2.0_741.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
license_expression: apache-2.0
is_license_notice: yes
relevance: 100
ignorable_authors:
- The Apache Software Foundation (http://www.apache.org/)
ignorable_urls:
- http://www.apache.org/
1 change: 1 addition & 0 deletions src/licensedcode/data/rules/apache-2.0_762.RULE
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
is Apache-licensed.
3 changes: 3 additions & 0 deletions src/licensedcode/data/rules/apache-2.0_762.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
license_expression: apache-2.0
is_license_notice: yes
relevance: 95
4 changes: 4 additions & 0 deletions src/licensedcode/data/rules/apache-2.0_768.RULE
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
This product includes software developed at
The Apache Software Foundation (http://www.apache.org/).

Includes software from other Apache Software Foundation projects,
5 changes: 5 additions & 0 deletions src/licensedcode/data/rules/apache-2.0_768.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
license_expression: apache-2.0
is_license_notice: yes
relevance: 95
ignorable_urls:
- http://www.apache.org/
1 change: 1 addition & 0 deletions src/licensedcode/data/rules/apache-2.0_769.RULE
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Includes software from other Apache Software Foundation projects,
3 changes: 3 additions & 0 deletions src/licensedcode/data/rules/apache-2.0_769.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
license_expression: apache-2.0
is_license_notice: yes
relevance: 95
1 change: 1 addition & 0 deletions src/licensedcode/data/rules/apache-2.0_778.RULE
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Apache-licensed.
3 changes: 3 additions & 0 deletions src/licensedcode/data/rules/apache-2.0_778.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
license_expression: apache-2.0
is_license_notice: yes
relevance: 95
3 changes: 3 additions & 0 deletions src/licensedcode/data/rules/apache-2.0_779.RULE
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
This product includes software developed by
The Apache Software Foundation (http://www.apache.org/).
See the LICENSE.txt for more details.
9 changes: 9 additions & 0 deletions src/licensedcode/data/rules/apache-2.0_779.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
license_expression: apache-2.0
is_license_notice: yes
relevance: 95
referenced_filenames:
- LICENSE.txt
ignorable_authors:
- The Apache Software Foundation (http://www.apache.org/)
ignorable_urls:
- http://www.apache.org/
5 changes: 5 additions & 0 deletions src/licensedcode/data/rules/apache-2.0_785.RULE
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
License
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
Same license text as listed above
Loading