-
-
Notifications
You must be signed in to change notification settings - Fork 579
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update license detection #2505
Update license detection #2505
Conversation
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Reported-by: Till Jaeger @LeChasseur Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
In particular some functions need all scancode licenses but not always. Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
And add tests for npm.compute_normalized_license Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
These are generated sequences of false positive licenses tyically found in license lists and license handling tools. Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
- Instead keep them until later stages of license matches refinements otherwise some false positive may be ignored. - Also streamline debug tracing printouts Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Adn update corresponding tests Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Add "only_known_words: yes" flag to these short GPL rules that are otherwise too often spurious false positive detections. Reported-by: Pierre Tardy <pierre.tardy@renault.com> Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Reported-by: Pierre Tardy <pierre.tardy@renault.com> Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Remove function licensedcode.cache.build_licenses_db as this is only a thin wrapper on licensedcode.models.load_licenses Also remove lower() in license URLs check Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Also remove unused MAX_DIST variable reference in refine_matches() Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
In refine_matches() call filter_if_only_known_words_rule() later in the process to ensure that small contained rules are not left at the end. Also format code Reported-by: Pierre Tardy <pierre.tardy@renault.com> Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
We need this such that we later treat stopwords as if they are unknown words. Reported-by: Pierre Tardy <pierre.tardy@renault.com> Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Calling tokens_by_line() with location and query_string arguments makes the code clearer and easier to read. Also apply minor formattings Reported-by: Pierre Tardy <pierre.tardy@renault.com> Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Run tokens_by_line() first without a stopwords argument. This allows stopword tokens to be included in the token stream and later to be treated as "unknown" tokens. This way the presence of stopwords in a match can impact a license match score and a license rule with `with_only_known_words: yes` annot be matched not only if there are unknown words but also if there are stopwords mixed its rule words. Also add a simple end-to-end integration test for this. Reported-by: Pierre Tardy <pierre.tardy@renault.com> Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Use KDE's own license refs. Make these regular licenses, not exceptions. Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Weed out some false positive detections Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Correct qualification of bsd-new and bsd-simplified rules that were incorrectly qualified as one or the other. Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
There are notices, not texts Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
For spdx_license_key, we use LicenseRef-scancode-* And we track other LicenseRef in other_spdx_license_keys Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Having a single function is nice but means there is an overhead on each call. Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
- Add new LicenseMatch.qcontains_stopwords() to check if stopwords are included in the matched range. And use function in filter_if_only_known_words_rule() filter. - Simplify handling of stop words in matched text processing. - Use separate index and query tokenizer functions Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Also add new index_tokenizer tests Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Merging this now. There are many new rules, including several generated rules for false positives list of SPDX license ids often found in license-related packages code and their test files. |
@@ -1,3 +1,5 @@ | |||
license_expression: gpl-3.0 | |||
is_license_tag: yes | |||
relevance: 100 | |||
relevance: 60 | |||
only_known_words: yes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:)
This is an omnibus PR for license detection updates for April 2021
Tasks
Run tests locally to check for errors.