high relevance for some very low character count rules find lots of GPL false positives #2484

tardyp · 2021-04-09T19:23:20Z

One of our configuration xml contains snippets like these (lots of them, number may vary):

               [...]  ((gpl) &lt; &quot;30&quot;) [...]

scancode find tons of GPL due to those rules:

gpl-3.0_126.RULE : "gpl 30"
gpl-1.0_16.RULE: "gpl 10"
gpl-2.0_693.RULE: "gpl 20"

They all have a relevance of 100, which looks a lot to me.

would it make sense to decrease the relevence?

Is there a nice way to blacklist the rule?
(I'll just rm them for now from the data directory...)

The text was updated successfully, but these errors were encountered:

Add "only_known_words: yes" flag to these short GPL rules that are otherwise too often spurious false positive detections. Reported-by: Pierre Tardy <pierre.tardy@renault.com> Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Reported-by: Pierre Tardy <pierre.tardy@renault.com> Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

In refine_matches() call filter_if_only_known_words_rule() later in the process to ensure that small contained rules are not left at the end. Also format code Reported-by: Pierre Tardy <pierre.tardy@renault.com> Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

We need this such that we later treat stopwords as if they are unknown words. Reported-by: Pierre Tardy <pierre.tardy@renault.com> Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Calling tokens_by_line() with location and query_string arguments makes the code clearer and easier to read. Also apply minor formattings Reported-by: Pierre Tardy <pierre.tardy@renault.com> Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Run tokens_by_line() first without a stopwords argument. This allows stopword tokens to be included in the token stream and later to be treated as "unknown" tokens. This way the presence of stopwords in a match can impact a license match score and a license rule with `with_only_known_words: yes` annot be matched not only if there are unknown words but also if there are stopwords mixed its rule words. Also add a simple end-to-end integration test for this. Reported-by: Pierre Tardy <pierre.tardy@renault.com> Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

pombredanne · 2021-04-12T17:46:04Z

One of our configuration xml contains snippets like these (lots of them, number may vary):
(gpl) < "30") [...

would it make sense to decrease the relevance?

Yes, it makes sense to lower this, likely to something such as 50?

Is there a nice way to blacklist the rule?
(I'll just rm them for now from the data directory...)

Not at the moment. But if you are deleting these there are other things to consider:

Adding a few is_false_positive: yes rules may be enough to handle these cases. For instance, I recently added several of these in a branch to cope with cases where we detect lists of SPDX license ids in tools with license-related features (such as the npm CLI tool). In many cases that's enough.
We could remove the rules in questions as they may be spurious after all.
There should be a way to also limit the issue at hand by tagging a rule with only_known_words: yes which means that no extra words should be included in between matched words of the rule, including words that do not exists anywhere in a license or rule.

Here we have a bug when this is used, as the lt and quot are special STOPWORDS that are ignored entirely from the stream of processed token. e.g. if I apply this patch, the detection would still return the spurious matches and it should not.

$ git diff --text src/licensedcode/data/rules/
diff --git a/src/licensedcode/data/rules/gpl-1.0_16.yml b/src/licensedcode/data/rules/gpl-1.0_16.yml
index 2348c02df..19fd52fb6 100644
--- a/src/licensedcode/data/rules/gpl-1.0_16.yml
+++ b/src/licensedcode/data/rules/gpl-1.0_16.yml
@@ -1,3 +1,4 @@
 license_expression: gpl-1.0
 is_license_tag: yes
 relevance: 100
+only_known_words: yes

and this file:

$ cat foo
(gpl) &lt; &quot;10&quot;) [...

So let me push something so that we can get something working and cleaner for a start.

Also do you have some common patterns of texts around these false matches of yours that you can share and that we could consider as is_false_positive rules?

I reckon is_false_positive is a bit contrived but when it comes to the GPL, I feel it is best to cast a wide net and filter out false positive rather than missing out matches entirely with false negative. This approach is based in part on the experience of extensive scanning we did to help the Linux kernel maintainers clean up the kernel licensing a while back.

pombredanne · 2021-04-12T17:46:49Z

@tardyp See the recent commits in https://github.com/nexB/scancode-toolkit/compare/2021-04-license-updates tagged with #2484

tardyp · 2021-04-12T19:14:43Z

hi @pombredanne
Thanks for the detailed explanation. Happy I reported two bugs for false positive!
I wondered indeed why the detection score wasn't decreased by the distance between two tokens.

60 looks good. I get the idea there is a need for a mode where scancode detects as much as possible.

Not sure that we want to add a false positive detection for this file type.

It is very specific to my company. I am not even sure what this formula is about, but I think it might be GPL, as for Gaz de Pétrole Liquéfié, LPG in english.

Thanks for the fix, I think they will clearly work for us.

pombredanne · 2021-04-12T21:34:13Z

I have always feared that we would get someday some false positive for GPL either from the LPG gas or from this https://gplinc.com ... the day has come!
So even if you think this is silly, adding a few false positive rules is welcomed.

Add "only_known_words: yes" flag to these short GPL rules that are otherwise too often spurious false positive detections. Reported-by: Pierre Tardy <pierre.tardy@renault.com> Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Reported-by: Pierre Tardy <pierre.tardy@renault.com> Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

In refine_matches() call filter_if_only_known_words_rule() later in the process to ensure that small contained rules are not left at the end. Also format code Reported-by: Pierre Tardy <pierre.tardy@renault.com> Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

We need this such that we later treat stopwords as if they are unknown words. Reported-by: Pierre Tardy <pierre.tardy@renault.com> Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Calling tokens_by_line() with location and query_string arguments makes the code clearer and easier to read. Also apply minor formattings Reported-by: Pierre Tardy <pierre.tardy@renault.com> Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Run tokens_by_line() first without a stopwords argument. This allows stopword tokens to be included in the token stream and later to be treated as "unknown" tokens. This way the presence of stopwords in a match can impact a license match score and a license rule with `with_only_known_words: yes` annot be matched not only if there are unknown words but also if there are stopwords mixed its rule words. Also add a simple end-to-end integration test for this. Reported-by: Pierre Tardy <pierre.tardy@renault.com> Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

pombredanne · 2021-04-23T12:49:42Z

@tardyp At this stage this has been merged in the develop branch. It took a bit longer as there was an unrelated bug discovered as a side effect on how and when false positive rules were filtered and also the impact of "stop words" on the overall tokenization and score of a matched license and how rules that should be matched verbatim were handled.
Thanks for the report! I am closing this now.

tardyp added the bug label Apr 9, 2021

pombredanne added a commit that referenced this issue Apr 12, 2021

Add tests for false positive GPL detections #2484

6f4e8ce

Reported-by: Pierre Tardy <pierre.tardy@renault.com> Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

pombredanne added license scan fixed pending review and removed bug labels Apr 12, 2021

pombredanne added a commit that referenced this issue Apr 14, 2021

Add tests for false positive GPL detections #2484

4823b67

Reported-by: Pierre Tardy <pierre.tardy@renault.com> Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

pombredanne closed this as completed Apr 23, 2021

This was referenced Aug 20, 2021

gpl1+ gpl2+ false detection #2585

Open

Discard matches to single GPL word and other very short rules with mixed, non-matching case and/or in a binary an/or not on a single line and/or in giberish #2403

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

high relevance for some very low character count rules find lots of GPL false positives #2484

high relevance for some very low character count rules find lots of GPL false positives #2484

tardyp commented Apr 9, 2021

pombredanne commented Apr 12, 2021

pombredanne commented Apr 12, 2021

tardyp commented Apr 12, 2021

pombredanne commented Apr 12, 2021 •

edited

Loading

pombredanne commented Apr 23, 2021

high relevance for some very low character count rules find lots of GPL false positives #2484

high relevance for some very low character count rules find lots of GPL false positives #2484

Comments

tardyp commented Apr 9, 2021

pombredanne commented Apr 12, 2021

pombredanne commented Apr 12, 2021

tardyp commented Apr 12, 2021

pombredanne commented Apr 12, 2021 • edited Loading

pombredanne commented Apr 23, 2021

pombredanne commented Apr 12, 2021 •

edited

Loading