Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update license detection #2505

Merged
merged 57 commits into from
Apr 23, 2021
Merged

Update license detection #2505

merged 57 commits into from
Apr 23, 2021

Conversation

pombredanne
Copy link
Member

@pombredanne pombredanne commented Apr 22, 2021

This is an omnibus PR for license detection updates for April 2021

Tasks

  • Reviewed contribution guidelines
  • PR is descriptively titled 📑 and links the original issue above 🔗
  • Tests pass -- look for a green checkbox ✔️ a few minutes after opening your PR
    Run tests locally to check for errors.
  • Commits are in uniquely-named feature branch and has no merge conflicts 📁

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Reported-by: Till Jaeger @LeChasseur
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
In particular some functions need all scancode licenses
but not always.

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
And add tests for npm.compute_normalized_license

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
These are generated sequences of false positive licenses tyically
found in license lists and license handling tools.

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
- Instead keep them until later stages of license matches refinements
otherwise some false positive may be ignored.

- Also streamline debug tracing printouts

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Adn update corresponding tests

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Add "only_known_words: yes" flag to these short GPL rules that are
otherwise too often spurious false positive detections.

Reported-by: Pierre Tardy <pierre.tardy@renault.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Reported-by: Pierre Tardy <pierre.tardy@renault.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Remove function  licensedcode.cache.build_licenses_db as this is only a
thin wrapper on licensedcode.models.load_licenses

Also remove lower() in license URLs check

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Also remove unused MAX_DIST variable reference in refine_matches()

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
In refine_matches() call filter_if_only_known_words_rule() later in the
process to ensure that small contained rules are not left at the end.

Also format code

Reported-by: Pierre Tardy <pierre.tardy@renault.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
We need this such that we later treat stopwords as if they are unknown
words.

Reported-by: Pierre Tardy <pierre.tardy@renault.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Calling tokens_by_line() with location and query_string arguments makes
the code clearer and easier to read.

Also apply minor formattings

Reported-by: Pierre Tardy <pierre.tardy@renault.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Run tokens_by_line() first without a stopwords argument.
This allows stopword tokens to be included in the token stream and
later to be treated as "unknown" tokens. This way the presence of
stopwords in a match can impact a license match score and a license rule
with `with_only_known_words: yes` annot be matched not only if there
are unknown words but also if there are stopwords mixed its rule words.

Also add a simple end-to-end integration test for this.

Reported-by: Pierre Tardy <pierre.tardy@renault.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Use KDE's own license refs.
Make these regular licenses, not exceptions.

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Weed out some false positive detections

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Correct qualification of bsd-new and bsd-simplified rules that were
incorrectly qualified as one or the other.

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
There are notices, not texts

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
For spdx_license_key, we use LicenseRef-scancode-*
And we track other LicenseRef in other_spdx_license_keys

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Having a single function is nice but means there is an overhead on each
call.

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
- Add new LicenseMatch.qcontains_stopwords() to check if stopwords are
  included in the matched range. And use function in
  filter_if_only_known_words_rule() filter.
- Simplify handling of stop words in matched text processing.
- Use separate index and query tokenizer functions

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Also add new index_tokenizer tests

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
@pombredanne
Copy link
Member Author

Merging this now. There are many new rules, including several generated rules for false positives list of SPDX license ids often found in license-related packages code and their test files.

@pombredanne pombredanne merged commit bb04420 into develop Apr 23, 2021
@pombredanne pombredanne deleted the 2021-04-license-updates branch April 23, 2021 06:41
@@ -1,3 +1,5 @@
license_expression: gpl-3.0
is_license_tag: yes
relevance: 100
relevance: 60
only_known_words: yes
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants