Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

high relevance for some very low character count rules find lots of GPL false positives #2484

Closed
tardyp opened this issue Apr 9, 2021 · 5 comments

Comments

@tardyp
Copy link
Contributor

tardyp commented Apr 9, 2021

One of our configuration xml contains snippets like these (lots of them, number may vary):

               [...]  ((gpl) < "30") [...]

scancode find tons of GPL due to those rules:

gpl-3.0_126.RULE : "gpl 30"
gpl-1.0_16.RULE: "gpl 10"
gpl-2.0_693.RULE: "gpl 20"

They all have a relevance of 100, which looks a lot to me.

would it make sense to decrease the relevence?

Is there a nice way to blacklist the rule?
(I'll just rm them for now from the data directory...)

@tardyp tardyp added the bug label Apr 9, 2021
pombredanne added a commit that referenced this issue Apr 12, 2021
Add "only_known_words: yes" flag to these short GPL rules that are
otherwise too often spurious false positive detections.

Reported-by: Pierre Tardy <pierre.tardy@renault.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
pombredanne added a commit that referenced this issue Apr 12, 2021
Reported-by: Pierre Tardy <pierre.tardy@renault.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
pombredanne added a commit that referenced this issue Apr 12, 2021
In refine_matches() call filter_if_only_known_words_rule() later in the
process to ensure that small contained rules are not left at the end.

Also format code

Reported-by: Pierre Tardy <pierre.tardy@renault.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
pombredanne added a commit that referenced this issue Apr 12, 2021
We need this such that we later treat stopwords as if they are unknown
words.

Reported-by: Pierre Tardy <pierre.tardy@renault.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
pombredanne added a commit that referenced this issue Apr 12, 2021
Calling tokens_by_line() with location and query_string arguments makes
the code clearer and easier to read.

Also apply minor formattings

Reported-by: Pierre Tardy <pierre.tardy@renault.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
pombredanne added a commit that referenced this issue Apr 12, 2021
Run tokens_by_line() first without a stopwords argument.
This allows stopword tokens to be included in the token stream and
later to be treated as "unknown" tokens. This way the presence of
stopwords in a match can impact a license match score and a license rule
with `with_only_known_words: yes` annot be matched not only if there
are unknown words but also if there are stopwords mixed its rule words.

Also add a simple end-to-end integration test for this.

Reported-by: Pierre Tardy <pierre.tardy@renault.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
@pombredanne
Copy link
Member

One of our configuration xml contains snippets like these (lots of them, number may vary):
(gpl) < "30") [...

would it make sense to decrease the relevance?

Yes, it makes sense to lower this, likely to something such as 50?

Is there a nice way to blacklist the rule?
(I'll just rm them for now from the data directory...)

Not at the moment. But if you are deleting these there are other things to consider:

  1. Adding a few is_false_positive: yes rules may be enough to handle these cases. For instance, I recently added several of these in a branch to cope with cases where we detect lists of SPDX license ids in tools with license-related features (such as the npm CLI tool). In many cases that's enough.

  2. We could remove the rules in questions as they may be spurious after all.

  3. There should be a way to also limit the issue at hand by tagging a rule with only_known_words: yes which means that no extra words should be included in between matched words of the rule, including words that do not exists anywhere in a license or rule.

Here we have a bug when this is used, as the lt and quot are special STOPWORDS that are ignored entirely from the stream of processed token. e.g. if I apply this patch, the detection would still return the spurious matches and it should not.

$ git diff --text src/licensedcode/data/rules/
diff --git a/src/licensedcode/data/rules/gpl-1.0_16.yml b/src/licensedcode/data/rules/gpl-1.0_16.yml
index 2348c02df..19fd52fb6 100644
--- a/src/licensedcode/data/rules/gpl-1.0_16.yml
+++ b/src/licensedcode/data/rules/gpl-1.0_16.yml
@@ -1,3 +1,4 @@
 license_expression: gpl-1.0
 is_license_tag: yes
 relevance: 100
+only_known_words: yes

and this file:

$ cat foo
(gpl) &lt; &quot;10&quot;) [...

So let me push something so that we can get something working and cleaner for a start.

Also do you have some common patterns of texts around these false matches of yours that you can share and that we could consider as is_false_positive rules?

I reckon is_false_positive is a bit contrived but when it comes to the GPL, I feel it is best to cast a wide net and filter out false positive rather than missing out matches entirely with false negative. This approach is based in part on the experience of extensive scanning we did to help the Linux kernel maintainers clean up the kernel licensing a while back.

@pombredanne
Copy link
Member

@tardyp
Copy link
Contributor Author

tardyp commented Apr 12, 2021

hi @pombredanne
Thanks for the detailed explanation. Happy I reported two bugs for false positive!
I wondered indeed why the detection score wasn't decreased by the distance between two tokens.

60 looks good. I get the idea there is a need for a mode where scancode detects as much as possible.

Not sure that we want to add a false positive detection for this file type.

It is very specific to my company. I am not even sure what this formula is about, but I think it might be GPL, as for Gaz de Pétrole Liquéfié, LPG in english.

Thanks for the fix, I think they will clearly work for us.

@pombredanne
Copy link
Member

pombredanne commented Apr 12, 2021

I have always feared that we would get someday some false positive for GPL either from the LPG gas or from this https://gplinc.com ... the day has come!
So even if you think this is silly, adding a few false positive rules is welcomed.

pombredanne added a commit that referenced this issue Apr 14, 2021
Add "only_known_words: yes" flag to these short GPL rules that are
otherwise too often spurious false positive detections.

Reported-by: Pierre Tardy <pierre.tardy@renault.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
pombredanne added a commit that referenced this issue Apr 14, 2021
Reported-by: Pierre Tardy <pierre.tardy@renault.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
pombredanne added a commit that referenced this issue Apr 14, 2021
In refine_matches() call filter_if_only_known_words_rule() later in the
process to ensure that small contained rules are not left at the end.

Also format code

Reported-by: Pierre Tardy <pierre.tardy@renault.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
pombredanne added a commit that referenced this issue Apr 14, 2021
We need this such that we later treat stopwords as if they are unknown
words.

Reported-by: Pierre Tardy <pierre.tardy@renault.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
pombredanne added a commit that referenced this issue Apr 14, 2021
Calling tokens_by_line() with location and query_string arguments makes
the code clearer and easier to read.

Also apply minor formattings

Reported-by: Pierre Tardy <pierre.tardy@renault.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
pombredanne added a commit that referenced this issue Apr 14, 2021
Run tokens_by_line() first without a stopwords argument.
This allows stopword tokens to be included in the token stream and
later to be treated as "unknown" tokens. This way the presence of
stopwords in a match can impact a license match score and a license rule
with `with_only_known_words: yes` annot be matched not only if there
are unknown words but also if there are stopwords mixed its rule words.

Also add a simple end-to-end integration test for this.

Reported-by: Pierre Tardy <pierre.tardy@renault.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
@pombredanne
Copy link
Member

@tardyp At this stage this has been merged in the develop branch. It took a bit longer as there was an unrelated bug discovered as a side effect on how and when false positive rules were filtered and also the impact of "stop words" on the overall tokenization and score of a matched license and how rules that should be matched verbatim were handled.
Thanks for the report! I am closing this now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants